Commit: 06a347b7
Scope: products/sandbox/evals/scenarios/tasks.ts, products/sandbox/evals/src/helpers.test.ts, products/sandbox/evals/src/helpers.ts, products/sandbox/evals/src/record.test.ts, products/sandbox/evals/src/record.ts, products/sandbox/evals/src/runner/index.ts, products/sandbox/sdk/src/client.ts, products/sandbox/sdk/src/collaboration/client.ts…
codex — ❌ FAILED
OpenAI Codex v0.122.0 (research preview)
workdir: /home/drew/company/tools/pr-reviewer/.webhook-state/checkouts/tangle-network-agent-dev-container-pr-819 model: gpt-5.3-codex provider: openai approval: never sandbox: workspace-write [workdir, /tmp, $TMPDIR, /home/drew/.codex/memories] reasoning effort: none reasoning summaries: none session id: 019db97d-f5ea-7ae0-80bf-f86e3db7a29d
user You are a senior engineer reviewing one assigned track of a code change. You are thorough but you respect the reader's time — every finding you report should be worth someone stopping to look at.
You own only your assigned track. Stay in scope. Do not report on files or concerns outside your track's scope unless you find something genuinely dangerous.
Read the evidence in your track scope. Think about what could go wrong in production, what would break silently, what a test wouldn't catch. Look at error paths, edge cases, and implicit assumptions.
Use sub-agents when it helps — for example, to check test coverage while you're reviewing implementation logic. But don't use them just because you can.
A good finding names a specific file and line, explains what's wrong, and shows evidence. "This nil check on line 47 of auth.go doesn't cover the case where the token is expired but structurally valid — the downstream handler will panic" is a finding. "Consider adding error handling" is not.
Ask yourself: would a staff engineer read this finding and say "good catch" or "obvious filler"? Only report the former.
- Findings about style, naming, or formatting unless they create actual confusion
- "Consider adding tests for X" without explaining what specific behavior is untested and why it matters
- Speculative findings without code evidence — if you can't point to the line, don't report it
- Duplicating findings that clearly belong to another track
Be honest about what you're sure about versus what you suspect. Flag uncertainty in confidence_notes rather than inflating finding confidence.
Return JSON only, no markdown fences.
{
"status": "ok|error",
"track_id": "from the assigned track",
"summary": "what you found, in one paragraph",
"findings": [
{
"severity": "high|medium|low",
"confidence": "high|medium|low",
"category": "correctness|security|regression|testing|operational",
"title": "short, specific title",
"body": "what's wrong, why it matters, what evidence you found",
"file": "path/to/file",
"line": 0,
"evidence": "the actual code or behavior that demonstrates the issue"
}
],
"questions": ["things you want to ask the author — genuine questions, not passive-aggressive suggestions"],
"confidence_notes": ["where you're uncertain and why"]
}
Based on 175 reviews with 1002 findings:
- sidecar: 89 findings historically
- container: 79 findings historically
- no test: 35 findings historically
- auth: 33 findings historically
- config: 33 findings historically
- timeout: 28 findings historically
- token: 25 findings historically
- leak: 23 findings historically
- lock: 20 findings historically
- test coverage: 17 findings historically
- [high] security: Docker runtime missing reserved env var blocking (SIDECAR_AUTH_DISABLED bypass)
- [high] correctness: removeContainerFromDriver swallows errors, leaks running containers
- [high] regression: AgentExecutionInput.message changed from required to optional -- breaks adapters
- [high] regression: TraceEventInput removes tool_call, tool_result, llm_request, llm_response event types
- [high] security: Docker runtime accepts arbitrary host bind mounts from request body
Prioritize findings in these known-weak areas. Do not waste review cycles on patterns outside these categories unless you find a genuine critical/high issue.
{ "changed_files": [ ".evolve/current.json", ".evolve/progress.md", ".evolve/pursuits/2026-04-22-adc-runtime-validation-loop.md", ".orchestrator/sandbox-benchmark-records.jsonl", "apps/host-agent/src/config.ts", "apps/host-agent/src/routes/runtime.ts", "apps/host-agent/src/server.ts", "apps/host-agent/tests/unit/runtime-app.test.ts", "apps/orchestrator/scripts/benchmark-startup-stream.ts", "apps/orchestrator/src/constants.ts", "apps/orchestrator/src/driver/docker.ts", "apps/orchestrator/src/egress/ARCHITECTURE.md", "apps/orchestrator/src/orchestrator/index.ts", "apps/orchestrator/src/orchestrator/project-manager.ts", "apps/orchestrator/src/orchestrator/sidecar-manager.ts", "apps/orchestrator/src/routes/projects.ts", "apps/orchestrator/src/services/snapshot-job-queue.ts", "apps/orchestrator/src/services/snapshot-service.ts", "apps/orchestrator/tests/unit/constants.test.ts", "apps/orchestrator/tests/unit/docker-driver-envs.test.ts", "apps/orchestrator/tests/unit/docker-driver-startsidecar.test.ts", "apps/orchestrator/tests/unit/orchestrator/project-manager.test.ts", "apps/sidecar/docker/Dockerfile", "apps/sidecar/src/constants.ts", "apps/sidecar/src/index.ts", "apps/sidecar/src/lib/token-blocklist.ts", "apps/sidecar/src/middleware/auth.ts", "apps/sidecar/src/routes/debug.ts", "apps/sidecar/tests/unit/debug-route.test.ts", "apps/sidecar/tests/unit/identity-headers.test.ts", "apps/sidecar/tests/unit/lib/token-blocklist.test.ts", "apps/sidecar/tests/unit/middleware/auth-jwt.test.ts", "packages/sdk-core/src/auth/tokens.ts", "packages/sdk-core/tests/auth/sidecar-tokens.test.ts", "packages/sdk-provider-opencode/src/server.ts", "packages/sdk-provider-opencode/tests/server-process-user.test.ts", "packages/shared/src/egress-types.ts", "products/sandbox/api/src/routes/sandboxes.ts", "products/sandbox/evals/scenarios/agent-redteam.ts", "products/sandbox/evals/scenarios/agent.ts", "products/sandbox/evals/scenarios/devcontainers.ts", "products/sandbox/evals/scenarios/direct-api-e2e.ts", "products/sandbox/evals/scenarios/direct-api.ts", "products/sandbox/evals/scenarios/driver-matrix.ts", "products/sandbox/evals/scenarios/infra-security.ts", "products/sandbox/evals/scenarios/lifecycle.ts", "products/sandbox/evals/scenarios/pentest-abuse.ts", "products/sandbox/evals/scenarios/pentest-adversarial.ts", "products/sandbox/evals/scenarios/pentest-compliance.ts", "products/sandbox/evals/scenarios/pentest-control-plane.ts", "products/sandbox/evals/scenarios/pentest-jwt-auth.ts", "products/sandbox/evals/scenarios/pentest-nation-state.ts", "products/sandbox/evals/scenarios/pentest-redteam-ai.ts", "products/sandbox/evals/scenarios/platform-redteam.ts", "products/sandbox/evals/scenarios/resilience.ts", "products/sandbox/evals/scenarios/sdk-dx.ts", "products/sandbox/evals/scenarios/security-boundaries.ts", "products/sandbox/evals/scenarios/tasks.ts", "products/sandbox/evals/src/helpers.test.ts", "products/sandbox/evals/src/helpers.ts", "products/sandbox/evals/src/record.test.ts", "products/sandbox/evals/src/record.ts", "products/sandbox/evals/src/runner/index.ts", "products/sandbox/sdk/src/client.ts", "products/sandbox/sdk/src/collaboration/client.ts", "products/sandbox/sdk/src/errors.ts", "products/sandbox/sdk/src/orchestrator.ts", "products/sandbox/sdk/src/sandbox.ts", "products/sandbox/sdk/tests/e2e/snapshot-lifecycle.test.ts", "products/sandbox/sdk/tests/e2e/task-execution.test.ts", "products/sandbox/sdk/tests/helpers/orchestrator-server.ts", "products/sandbox/sdk/tests/helpers/test-storage-agent.ts", "products/sandbox/sdk/tests/setup.ts", "products/sandbox/sdk/tests/unit/errors.test.ts" ], "pr": 819, "repo": "tangle-network/agent-dev-container", "track": { "evidence_targets": [ "products/sandbox/evals/scenarios/tasks.ts", "products/sandbox/evals/src/helpers.test.ts", "products/sandbox/evals/src/helpers.ts", "products/sandbox/evals/src/record.test.ts", "products/sandbox/evals/src/record.ts", "products/sandbox/evals/src/runner/index.ts", "products/sandbox/sdk/src/client.ts", "products/sandbox/sdk/src/collaboration/client.ts", "products/sandbox/sdk/src/errors.ts", "products/sandbox/sdk/src/orchestrator.ts", "products/sandbox/sdk/src/sandbox.ts", "products/sandbox/sdk/tests/e2e/snapshot-lifecycle.test.ts", "products/sandbox/sdk/tests/e2e/task-execution.test.ts", "products/sandbox/sdk/tests/helpers/orchestrator-server.ts", "products/sandbox/sdk/tests/helpers/test-storage-agent.ts", "products/sandbox/sdk/tests/setup.ts", "products/sandbox/sdk/tests/unit/errors.test.ts" ], "goal": "Audit changed files for correctness, security, tests, and maintainability.", "scope": [ "products/sandbox/evals/scenarios/tasks.ts", "products/sandbox/evals/src/helpers.test.ts", "products/sandbox/evals/src/helpers.ts", "products/sandbox/evals/src/record.test.ts", "products/sandbox/evals/src/record.ts", "products/sandbox/evals/src/runner/index.ts", "products/sandbox/sdk/src/client.ts", "products/sandbox/sdk/src/collaboration/client.ts", "products/sandbox/sdk/src/errors.ts", "products/sandbox/sdk/src/orchestrator.ts", "products/sandbox/sdk/src/sandbox.ts", "products/sandbox/sdk/tests/e2e/snapshot-lifecycle.test.ts", "products/sandbox/sdk/tests/e2e/task-execution.test.ts", "products/sandbox/sdk/tests/helpers/orchestrator-server.ts", "products/sandbox/sdk/tests/helpers/test-storage-agent.ts", "products/sandbox/sdk/tests/setup.ts", "products/sandbox/sdk/tests/unit/errors.test.ts" ], "should_use_subagents": true, "suggested_provider": "", "track_id": "track-04" } }
diff --git a/.evolve/current.json b/.evolve/current.json index 86f3fe3..222ca32 100644 --- a/.evolve/current.json +++ b/.evolve/current.json @@ -1,47 +1,28 @@ { "mode": "pursue",
- "goal": "close all pre-existing errors + harden Nix + prevent stale-dist recurrence (Gen 4)",
- "status": "gen4_shipped",
- "goal": "validate and improve ADC's six must-win runtime claims: speed, correctness, reliability, security, statefulness, reproducibility",
- "status": "generation 1 reliability/security tranche advanced; streamed benchmark path exposed a remaining first-turn runtime defect", "round": 1,
- "generation": 4,
- "activePursuit": ".evolve/pursuits/2026-04-19-gen4-delete-mocks-harden-real.md",
- "branch": "feat/agents-mega-pr",
- "pr": "https://github.com/tangle-network/agent-dev-container/pull/722",
- "generation": 1,
- "activePursuit": ".evolve/pursuits/2026-04-22-adc-runtime-validation-loop.md",
- "branch": "develop", "metrics": {
- "orchestratorUnitPass": "1769 / 1769 (0 failures, was 1744 / 1805 with 61 failures)",
- "sidecarTestsPass": 998,
- "sidecarTestsSkipped": 2,
- "registryTestsPass": 18,
- "forgeTestsPass": 40,
- "acpTestsPass": 45,
- "pureMockTestingMockFilesDeleted": 6,
- "pureMockLinesDeleted": 1500,
- "testFilesFixedByRealRootCause": 10,
- "staleDistCanaryPackages": 4,
- "nixStaleTodoCommentsRemoved": 1,
- "nixFakeHashes": 0,
- "phase15Gate": "BLOCK-CONVERTED-TO-SHIP-WITH-CONDITIONS (3 blockers surfaced + resolved)",
- "phase35DiffAudit": "SHIP-WITH-FIXES (0 CRIT, 0 HIGH, 2/3 MED applied, 1 LOW applied as defense-in-depth)"
- "provisioningSpeed": "baseline pass: ttfx ~2.86s, contention 1.14x",
- "runtimeCorrectness": "validated green locally: direct-api 17/17, sdk-dx 3/3; stale sidecar image and direct-mode API mapping defects fixed; firecracker still unreachable locally",
- "sandboxReliability": "improved locally: stop/resume persistence and devcontainer cold-build path green; snapshot lifecycle now fails explicitly with 501 when backend is unavailable; streamed benchmark still exposes an OpenCode first-turn readiness defect",
- "security": "improved locally: direct runtime security-boundary probes green for read-only nix mount and container-escape denial; full agentic redteam pack is still long-running on the first control-plane scenario",
- "statefulness": "validated green locally: direct-api stop/resume persistence proved end-to-end; snapshot unsupported path now explicit instead of opaque 500",
- "reproducibility": "benchmark harness now records session creation and 1st/2nd/3rd/5th output-event timings, but the streamed first-turn path is not yet green enough to claim reproducibility" },
- "gen4Shipped": [
- "Mock factory PangolinCtor rewritten: real function with mutable __impl slot — vitest@4
new PangolinClient()now returns the test's programmed object, not the mock fn prototype", - "6 pure mock-testing-mock test files deleted (~1,500 lines): mock-infrastructure, mock-completeness, mock-fix-verification, pangolin-mock-controls, rate-limiting-mock, test-isolation-basic",
- "10 orchestrator unit test files fixed against real root cause: docker-driver-envs, docker-driver-preserve-workspace, pangolin-lifecycle/simple/preprovision, tangle/client-runtime, server-bootstrap, error-handling, projects-snapshots-route, vitest.config.ts @repo/shared alias",
- "Error-handling test expectations updated to match post-#463 real code behavior (retry paths, graceful null-metadata degrade, invalid-protocol guard); null explicitly re-added per Phase 3.5 audit",
- "scripts/check-dist-freshness.mjs — new invariant catches the Gen 2 (storage/dist stale) + evolve R1 (sdk-core/dist stale) failure pattern at
pnpm run check:invariantstime. Covers 4 shared packages with canary exports.", - "Nix stale TODO(gen-3 nix-acp) comment removed from agent-clis.nix — Gen 3 already shipped mkNpmAgentWithDeps",
- "package.json check:invariants wired to include new canary invariant"
- ],
- "gen4Deferred": [
- "Phase 3.5 LOW findings: empty-string protocol in invalidProtocols (already covered under the bundled fix); alternate corrupted-metadata shapes (pangolin: 'garbage' / pangolin: []) — defensive branch exists in preview-link-service.ts:1228; worth a follow-up assertion, not a blocker",
- "Audit flagged globalThis.__mockDocker pattern as fragile if vitest config ever flips to pool:forks + isolate:false. Seed for Gen 5 if pool changes."
- ],
- "productValueClaim": "Zero failing orchestrator unit tests = every CI run produces signal, not noise. Stale-dist invariant = the Gen 2 + evolve R1 pattern (silent typecheck-against-stale-compiled-artifact) is now caught at
pnpm run check:invariantstime, not during a PR typecheck.", - "nextMove": "PR #722 has no local blockers remaining. Real-infra matrix on drew-gtr-pro is the last gate (evolve R1 unblocked it for all 10 backends). After that merges, Gen 5 targets: (a) apply globalThis → closure-captured mock pattern uniformly across orchestrator tests for parallelism safety; (b) extend stale-dist canaries to every shared workspace package; (c) the bundled acp/openclaw Dockerfile fallback block deletes on the commit after #722 merges per Phase 1.5 C6.",
- "previousGeneration": {
- "generation": 3,
- "status": "shipped",
- "completedAt": "2026-04-19T21:05:00Z"
- "metricClaims": {
- "provisioningSpeed": "If startup/resume latency drops, the user reaches a live agent faster and abandons fewer sessions during setup.",
- "runtimeCorrectness": "If real driver and SDK flows pass, any supported CLI or agent harness can actually run instead of failing after provisioning.",
- "sandboxReliability": "If lifecycle, recovery, and snapshot flows survive failures, long-running agent work stops losing state or hanging on infra edges.",
- "security": "If redteam and pentest scenarios pass, hostile code and control-plane misuse are blocked before they become customer incidents.",
- "statefulness": "If stop/resume/snapshot/session flows work repeatedly, agents can continue work over hours or days instead of being one-shot turns.",
- "reproducibility": "If the same workload/profile produces consistent outcomes and captured artifacts, debugging and optimization are trustworthy rather than anecdotal." },
- "updatedAt": "2026-04-19T22:55:00Z"
- "productValueClaim": "ADC only has a real moat if its runtime claims are measured and repeatable. Truthful, repeatable runtime evidence is more valuable than another unverified feature.",
- "nextMove": "Fix the streamed first-turn OpenCode readiness failure exposed by the session-backed benchmark, finish a bounded redteam rerun, then rerun the startup-stream benchmark and promote the event-timeline artifact into the runtime scorecard.",
- "updatedAt": "2026-04-22T21:36:00Z" } diff --git a/.evolve/progress.md b/.evolve/progress.md index b4d88ae..fa8eb4f 100644 --- a/.evolve/progress.md +++ b/.evolve/progress.md @@ -1,6 +1,59 @@
PR: https://github.com/tangle-network/agent-dev-container/pull/722
+## ADC Runtime Validation Loop — KICKOFF (2026-04-22)
+
+New active pursuit:
+- .evolve/pursuits/2026-04-22-adc-runtime-validation-loop.md
+
+Scope:
+- provisioning speed
+- runtime correctness
+- sandbox reliability
+- security
+- statefulness
+- reproducibility
+
+Intent:
+- stop treating ADC product claims as narrative
+- promote them to measurable gates with commands, artifacts, and repeatable baselines
+- use existing harnesses first, then add only the missing measurement seams
+
+Immediate next move:
+1. baseline eval:benchmark
+2. baseline eval:drivers, eval:direct-api, eval:sdk-dx
+3. baseline eval:lifecycle, eval:resilience, test:sandbox:sdk:snapshot-e2e
+4. baseline eval:redteam
+5. identify missing explicit harnesses for statefulness + reproducibility
+
+## ADC Runtime Validation Loop — GENERATION 1 UPDATE (2026-04-22T21:08Z)
+
+What closed in this tranche:
+- direct API correctness is green on the real local stack: 17/17
+- SDK DX is green on the real local stack: 3/3
+- statefulness is green on the public suspend/resume path:
-
direct-api.e2e-stop-resume-execpassed after moving the probe onto the real persisted workspace root
-
- the underlying runtime issue was a real Docker driver defect: project workspace directories were starting
root:root, so the agent user could not write into them
- the underlying runtime issue was a real Docker driver defect: project workspace directories were starting
-
- fixed in
apps/orchestrator/src/driver/docker.tsby repairing workspace ownership before sidecar bootstrap +- snapshot lifecycle no longer lies:
- fixed in
-
- unsupported local Docker snapshot backend now surfaces as explicit
501 SNAPSHOT_SERVICE_UNAVAILABLE
- unsupported local Docker snapshot backend now surfaces as explicit
-
- sandbox API proxy now preserves the upstream status/code instead of flattening it to
500
- sandbox API proxy now preserves the upstream status/code instead of flattening it to
+Measured evidence:
+- /tmp/adc-runtime-validation/correctness/direct-api-rerun4/eval-result.json
+- /tmp/adc-runtime-validation/correctness/sdk-dx-rerun4/eval-result.json
+- /tmp/adc-runtime-validation/statefulness/snapshot-scenario-rerun2/eval-result.json
+- /tmp/adc-runtime-validation/statefulness/stop-resume-rerun7/eval-result.json
+
+Verified locally:
+- pnpm --filter @tangle-network/orchestrator exec tsc --noEmit
+- pnpm --filter @tangle-network/sandbox-api exec tsc --noEmit
+
+Remaining gap after this generation:
+1. run the broader reliability packs (eval:lifecycle, eval:resilience, snapshot e2e) on the repaired stack
+2. rerun redteam/pentest against the repaired stack
+3. build the explicit reproducibility harness; this is still the least-measured ADC claim
+
"Fix all pre-existing errors + close Nix loose ends + prevent recurrence" @@ -270,8 +323,56 @@ Closed 5 of 15 deferred items from gen-1's harden pass. 666-sidecar baseline pre
KEEP — ADVANCE. Round 2 ships 5 high-confidence correctness/observability fixes with regression tests. No regressions. Remaining work is architectural refactors (better as their own focused commits) and ops/e2e validation (gated on real infra access, not code).
+# ADC Runtime Validation Loop — BASELINE 1 (2026-04-22)
+
+- Installed local skills from ~/code/dotfiles into ~/.codex/skills and aligned this repo with a pursue -> evolve loop.
+- Captured first baseline artifacts under /tmp/adc-runtime-validation/.
+- Provisioning speed is currently the strongest validated claim:
-
- benchmark suite passed
6/6
- benchmark suite passed
-
- cold TTFX
2864ms
- cold TTFX
-
- warm TTFX
2811ms
- warm TTFX
-
- parallel create contention ratio
1.14x+- Correctness is not yet validated:
- parallel create contention ratio
-
- driver matrix
4/8with firecracker target unreachable atlocalhost:5095
- driver matrix
-
- direct API suite has
5failing scenarios
- direct API suite has
-
- SDK DX suite failed
2/3+- Reliability/statefulness are weaker than expected:
- SDK DX suite failed
-
- lifecycle suite passed
19/34
- lifecycle suite passed
-
- snapshot E2E passed
1/5+- Highest-signal failures found so far:
- snapshot E2E passed
-
- invalid image path reaches
runninginstead of failing clearly
- invalid image path reaches
-
- SDK file roundtrip writes but reads back empty content
-
- sidecar health path returns
403through direct API sidecar proxy flow
- sidecar health path returns
-
- snapshot list/restore/create-from-snapshot paths fail against storage agent
-
- multiple devcontainer/container-strategy scenarios fail with
spawn opencode EACCES
- multiple devcontainer/container-strategy scenarios fail with
-
- firecracker baseline is polluted by missing local orchestrator availability, not runtime execution
Hand off: round 3 of /evolve should pick up the 3 MEDIUM quality items (same pattern). The 2 HIGH refactors deserve their own focused commits. Ops items need drew-gtr-pro time. sdk-provider-acpx is a separate PR.
+## ADC Runtime Validation Loop — GENERATION 1 RELIABILITY/SECURITY UPDATE (2026-04-22T21:36Z) + +- Closed the stale local-image eval drift:
-
devcontainer.build-cache-nodenow uses the real local default image contract and passes again.
-
- artifact:
/tmp/adc-runtime-validation/reliability/devcontainer-build-cache-node-rerun/eval-result.json+- Closed the slow, agent-mediated security checks by replacing them with direct runtime probes:
- artifact:
-
security.nix-mount-readonlynow verifies the read-only nix mount with raw command execution and passes in2.1s
-
security.no-container-escapenow probes/proc/1/environ,/host, and/var/run/docker.sockdirectly and passes in2.3s
-
- artifact:
/tmp/adc-runtime-validation/security/security-boundaries-rerun/eval-result.json+- Extended the startup-stream benchmark harness to record:
- artifact:
-
- session creation latency
-
- first/second/third/fifth meaningful output-event timings
-
- first tool-invocation timing
-
- inter-event average gap +- The benchmark work found two real first-turn runtime defects:
-
- fixed: benchmark HOME/XDG path mismatch causing
EACCESon~/.local
- fixed: benchmark HOME/XDG path mismatch causing
-
- still open: session-backed first streamed turn can fail with
OpenCode server is not responding+- Targeted verification completed:
- still open: session-backed first streamed turn can fail with
-
pnpm --filter @tangle-network/orchestrator exec vitest run tests/unit/docker-driver-envs.test.ts
-
pnpm --filter @tangle-network/orchestrator exec tsc --noEmit
+Current highest-signal remaining gap: +- the streamed first-turn benchmark path is still not trustworthy enough for showcase numbers because OpenCode readiness can fail after session creation. This is now the next fixation point, not a hidden benchmark artifact. +
diff --git a/.evolve/pursuits/2026-04-22-adc-runtime-validation-loop.md b/.evolve/pursuits/2026-04-22-adc-runtime-validation-loop.md
new file mode 100644
index 0000000..890b886
--- /dev/null
+++ b/.evolve/pursuits/2026-04-22-adc-runtime-validation-loop.md
@@ -0,0 +1,213 @@
+# Pursuit: ADC Runtime Validation Loop
+Generation: 1
+Date: 2026-04-22
+Status: generation-1-in-progress
+
+## Goal
+
+Validate and improve the six must-win claims for Agent Dev Container as a production agent runtime:
+
+1. Provisioning speed
+2. Runtime correctness
+3. Sandbox reliability
+4. Security
+5. Statefulness
+6. Reproducibility
+
+This pursuit owns the measurement system and the improvement loop. No claim ships without a harness, artifact path, and repeatable command.
+
+## Metric → product-value claim
+
+| Dimension | Product-value claim |
+|---|---|
+| Provisioning speed | If startup/resume latency drops, the user reaches a live agent faster and abandons fewer sessions during setup. |
+| Runtime correctness | If real driver and SDK flows pass, any supported CLI/agent harness can actually run instead of failing after provisioning. |
+| Sandbox reliability | If lifecycle/recovery/snapshot flows survive failures, long-running agent work stops losing state or hanging on infra edges. |
+| Security | If redteam and pentest scenarios pass, hostile code and control-plane misuse are blocked before they become customer incidents. |
+| Statefulness | If stop/resume/snapshot/session flows work repeatedly, agents can continue work over hours or days instead of being one-shot turns. |
+| Reproducibility | If the same workload/profile produces consistent outcomes and captured artifacts, debugging and optimization are trustworthy rather than anecdotal. |
+
+## Existing harness inventory
+
+### Provisioning speed
+- pnpm --dir products/sandbox/evals eval:benchmark
+- pnpm --dir products/sandbox/evals eval:benchmark:staging
+- Evidence:
-
.evolve/benchmark-baseline.tsv
-
.evolve/benchmark-final.tsv
-
products/sandbox/evals/scenarios/benchmark-startup.ts
+### Runtime correctness
+- pnpm test:real-infra:strict
+- pnpm --dir products/sandbox/evals eval:drivers
+- pnpm --dir products/sandbox/evals eval:direct-api
+- pnpm --dir products/sandbox/evals eval:sdk-dx
+- Evidence:
-
scripts/real-infra-matrix.sh
-
scripts/lib/orchestrator-driver-commands.sh
-
products/sandbox/evals/scenarios/driver-matrix.ts
-
products/sandbox/evals/scenarios/direct-api.ts
-
products/sandbox/evals/scenarios/sdk-dx.ts
+### Sandbox reliability
+- pnpm --dir products/sandbox/evals eval:reliability
+- pnpm --dir products/sandbox/evals eval:lifecycle
+- pnpm --dir products/sandbox/evals eval:resilience
+- pnpm test:sandbox:sdk:snapshot-e2e
+- Evidence:
-
products/sandbox/evals/scenarios/lifecycle.ts
-
products/sandbox/evals/scenarios/resilience.ts
-
products/sandbox/evals/scripts/ci-harness.ts
+### Security
+- pnpm --dir products/sandbox/evals eval:redteam
+- pnpm --dir products/sandbox/evals eval:agent:redteam
+- pnpm --dir products/sandbox/evals eval:platform:redteam
+- pnpm --dir products/sandbox/evals eval:pentest
+- Evidence:
-
products/sandbox/evals/scenarios/infra-security.ts
-
products/sandbox/evals/scenarios/platform-redteam.ts
-
products/sandbox/evals/scenarios/pentest-*
+### Statefulness
+- pnpm --dir products/sandbox/evals eval:lifecycle
+- pnpm test:sandbox:sdk:snapshot-e2e
+- existing route/service tests under orchestrator + sandbox API
+- Evidence:
-
products/sandbox/evals/scenarios/lifecycle.ts
-
products/sandbox/evals/src/helpers.ts
-
apps/orchestrator/tests/unit/services/checkpoint-service.test.ts
-
apps/orchestrator/tests/unit/services/snapshot-job-queue.test.ts
+### Reproducibility +- partial coverage exists through scorecard/convergence/statistics, but no explicit ADC reproducibility gate yet +- likely starting points:
-
pnpm --dir products/sandbox/evals eval:strategy
-
pnpm --dir products/sandbox/evals eval:full
-
products/sandbox/evals/src/stats.ts+- Missing:
-
- explicit repeated-run consistency harness for the same workload/profile/runtime capture
-
- explicit artifact comparison contract for environment identity and replayability
+## Baseline commands + +Run in this order unless blocked by infra: + +1. Provisioning speed
-
pnpm --dir products/sandbox/evals eval:benchmark+2. Runtime correctness
-
pnpm --dir products/sandbox/evals eval:drivers
-
pnpm --dir products/sandbox/evals eval:direct-api
-
pnpm --dir products/sandbox/evals eval:sdk-dx+3. Sandbox reliability
-
pnpm --dir products/sandbox/evals eval:lifecycle
-
pnpm --dir products/sandbox/evals eval:resilience
-
pnpm test:sandbox:sdk:snapshot-e2e+4. Security
-
pnpm --dir products/sandbox/evals eval:redteam+5. Statefulness
-
- derive from lifecycle + snapshot baseline above +6. Reproducibility
-
pnpm --dir products/sandbox/evals eval:strategy
-
- then add a dedicated harness if current outputs are insufficient
+## Artifact contract
+
+- Local eval output root: /tmp/adc-runtime-validation/<dimension>/
+- Canonical long-lived repo state:
-
.evolve/current.json
-
.evolve/progress.md
-
.evolve/experiments.jsonl
-
- this pursuit doc +- Every dimension must record:
-
- command
-
- timestamp
-
- output directory
-
- summary result
-
- next experiment if below target
+## Success thresholds (initial)
+
+These are not final product SLAs. They are the first hard gate for truthful positioning.
+
+| Dimension | Initial gate |
+|---|---|
+| Provisioning speed | benchmark artifacts present, p50/p95 captured, warm/cold split explicit |
+| Runtime correctness | driver/direct-api/sdk-dx eval pack green locally |
+| Sandbox reliability | lifecycle + resilience + snapshot e2e green locally |
+| Security | redteam pack green locally with no new high-severity findings |
+| Statefulness | stop/resume + snapshot restore proven through public surfaces, not unit-only |
+| Reproducibility | repeated-run variance measured and an explicit reproducibility harness/spec exists |
+
+## Immediate gaps
+
+1. Reproducibility is under-measured. Existing stats helpers are not the same as an ADC reproducibility gate.
+2. The six dimensions are not yet reflected in one scorecard or release gate.
+
+## Generation 1 thesis
+
+Before optimizing anything further, make the six ADC product claims mechanically checkable with real commands and artifacts. The first win is a truthful scorecard, not a speculative speedup.
+
+## Generation 1 build/results
+
+### Runtime correctness
+- eval:direct-api is green locally: 17/17
+- eval:sdk-dx is green locally: 3/3
+- real fixes shipped:
-
- direct-mode eval helper now maps SDK/direct payloads to orchestrator
/projectscorrectly
- direct-mode eval helper now maps SDK/direct payloads to orchestrator
-
- sidecar JWT verifier now accepts the Docker short-hostname/container-id mismatch
-
- local
sidecar:localimage was rebuilt so runtime behavior matches source
- local
-
- sandbox SDK error surfaces now preserve actionable upstream details
+### Statefulness
+- direct-api.e2e-stop-resume-exec is green locally with persisted file proof
+- infra.snapshot-lifecycle now passes honestly by skipping unsupported local Docker snapshot backends on explicit 501 SNAPSHOT_SERVICE_UNAVAILABLE
+- real fixes shipped:
-
- orchestrator snapshot routes now return explicit
501for missing snapshot backend instead of opaque500
- orchestrator snapshot routes now return explicit
-
- sandbox API snapshot proxy now preserves upstream snapshot error status/codes
-
- Docker driver now repairs project workspace ownership before sidecar bootstrap, fixing agent write failures in project roots
+### Key artifacts
+- /tmp/adc-runtime-validation/correctness/direct-api-rerun4/eval-result.json
+- /tmp/adc-runtime-validation/correctness/sdk-dx-rerun4/eval-result.json
+- /tmp/adc-runtime-validation/statefulness/snapshot-scenario-rerun2/eval-result.json
+- /tmp/adc-runtime-validation/statefulness/stop-resume-rerun7/eval-result.json
+
+### Verified
+- pnpm --filter @tangle-network/orchestrator exec tsc --noEmit
+- pnpm --filter @tangle-network/sandbox-api exec tsc --noEmit
+
+### Remaining generation-1 work
+1. rerun the broader lifecycle/resilience packs on the repaired stack
+2. rerun redteam/pentest against the repaired stack
+3. add an explicit reproducibility harness and artifact contract
+
+## Generation 1 follow-on results
+
+### Reliability
+- devcontainer.build-cache-node is green locally again after removing stale raw-image assumptions from evals
+- artifact:
-
/tmp/adc-runtime-validation/reliability/devcontainer-build-cache-node-rerun/eval-result.json
+### Security +- direct security-boundary probes are green locally:
-
security.nix-mount-readonly
-
security.no-container-escape+- these now execute raw commands instead of waiting on an agent prompt, which makes them faster and less ambiguous +- artifact:
-
/tmp/adc-runtime-validation/security/security-boundaries-rerun/eval-result.json
+### Measurement system +- startup-stream benchmark harness now captures:
-
session_create_ms
-
- first/second/third/fifth meaningful output events
-
- first tool-invocation timing
-
- inter-event average gap +- this made the benchmark more useful and immediately exposed a real first-turn runtime defect instead of hiding it behind TTFT-only reporting
+### New defect exposed by the measurement loop +- session-backed streaming benchmark now fails later and more honestly:
-
- after sandbox ready
-
- after session creation
-
- at first streamed turn with
OpenCode server is not responding+- this is the next generation-1 blocker because it affects the exact “first visible output” path we want to showcase and optimize diff --git a/.orchestrator/sandbox-benchmark-records.jsonl b/.orchestrator/sandbox-benchmark-records.jsonl index ccb7e06..0cddbc3 100644 --- a/.orchestrator/sandbox-benchmark-records.jsonl +++ b/.orchestrator/sandbox-benchmark-records.jsonl @@ -599,3 +599,193 @@ {"run_id":"065afa5d","scenario_id":"benchmark.file-operations","iteration":1,"timestamp":"2026-04-12T02:30:19.965Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"files","difficulty":"standard","target":"staging","pass":true,"error":null,"total_ms":33649,"provision_ms":7382,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","files","requires-docker"]} {"run_id":"065afa5d","scenario_id":"benchmark.sidecar-health-latency","iteration":1,"timestamp":"2026-04-12T02:30:19.965Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"staging","pass":true,"error":null,"total_ms":25622,"provision_ms":6909,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","health","requires-docker"]} {"run_id":"065afa5d","scenario_id":"benchmark.parallel-create-3","iteration":1,"timestamp":"2026-04-12T02:30:19.966Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"stress","target":"staging","pass":true,"error":null,"total_ms":29962,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","parallel","stress","requires-docker"]} +{"run_id":"0b1cdaa8","scenario_id":"driver.docker.provision","iteration":1,"timestamp":"2026-04-22T19:58:40.595Z","driver":"docker","environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1928,"provision_ms":1554,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-docker","requires-docker"]} +{"run_id":"0b1cdaa8","scenario_id":"driver.docker.session","iteration":1,"timestamp":"2026-04-22T19:58:40.595Z","driver":"docker","environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1640,"provision_ms":1314,"session_create_ms":14,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-docker","requires-docker"]} +{"run_id":"0b1cdaa8","scenario_id":"driver.docker.prompt","iteration":1,"timestamp":"2026-04-22T19:58:40.595Z","driver":"docker","environment":null,"model":"glm-4.7","backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1854,"provision_ms":1377,"session_create_ms":89,"first_token_ms":1,"task_complete_ms":17,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":4,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-docker","requires-docker"]} +{"run_id":"0b1cdaa8","scenario_id":"driver.docker.delete","iteration":1,"timestamp":"2026-04-22T19:58:40.595Z","driver":"docker","environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1697,"provision_ms":1395,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":294,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-docker","requires-docker"]} +{"run_id":"0b1cdaa8","scenario_id":"driver.firecracker.provision","iteration":1,"timestamp":"2026-04-22T19:58:40.596Z","driver":"firecracker","environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":2,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-firecracker","requires-docker"]} +{"run_id":"0b1cdaa8","scenario_id":"driver.firecracker.session","iteration":1,"timestamp":"2026-04-22T19:58:40.596Z","driver":"firecracker","environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":35,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-firecracker","requires-docker"]} +{"run_id":"0b1cdaa8","scenario_id":"driver.firecracker.prompt","iteration":1,"timestamp":"2026-04-22T19:58:40.596Z","driver":"firecracker","environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-firecracker","requires-docker"]} +{"run_id":"0b1cdaa8","scenario_id":"driver.firecracker.delete","iteration":1,"timestamp":"2026-04-22T19:58:40.596Z","driver":"firecracker","environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-firecracker","requires-docker"]} +{"run_id":"9bf25e3d","scenario_id":"benchmark.cold-provision-breakdown","iteration":1,"timestamp":"2026-04-22T19:58:59.334Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3238,"provision_ms":2,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":1015,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","startup","requires-docker"]} +{"run_id":"9bf25e3d","scenario_id":"benchmark.warm-provision","iteration":1,"timestamp":"2026-04-22T19:58:59.335Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3241,"provision_ms":1,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":1015,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","startup","requires-docker"]} +{"run_id":"9bf25e3d","scenario_id":"benchmark.exec-latency","iteration":1,"timestamp":"2026-04-22T19:58:59.335Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":5526,"provision_ms":1557,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","exec","requires-docker"]} +{"run_id":"9bf25e3d","scenario_id":"benchmark.file-operations","iteration":1,"timestamp":"2026-04-22T19:58:59.335Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"files","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":5499,"provision_ms":1535,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","files","requires-docker"]} +{"run_id":"9bf25e3d","scenario_id":"benchmark.sidecar-health-latency","iteration":1,"timestamp":"2026-04-22T19:58:59.335Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1940,"provision_ms":1546,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","health","requires-docker"]} +{"run_id":"9bf25e3d","scenario_id":"benchmark.parallel-create-3","iteration":1,"timestamp":"2026-04-22T19:58:59.335Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":3205,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","parallel","stress","requires-docker"]} +{"run_id":"6bba2006","scenario_id":"sdk-dx.time-to-first-sandbox","iteration":1,"timestamp":"2026-04-22T20:00:10.387Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1583,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"6bba2006","scenario_id":"sdk-dx.error-message-quality","iteration":1,"timestamp":"2026-04-22T20:00:10.387Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1837,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx"]} +{"run_id":"6bba2006","scenario_id":"sdk-dx.file-operations-roundtrip","iteration":1,"timestamp":"2026-04-22T20:00:10.387Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"smoke","target":"local","pass":false,"error":null,"total_ms":2497,"provision_ms":1388,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"lifecycle.create-and-delete","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"docker","environment":"universal","model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":"cold","flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1384,"provision_ms":1384,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":255,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast","requires-docker","startup-baseline"]} +{"run_id":"554ec2be","scenario_id":"lifecycle.create-first-runtime-probe","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"docker","environment":"universal","model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":"cold","flow":"cold_provision","journey":"sandbox_create","measure":"first_runtime_probe","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1801,"provision_ms":1349,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":452,"delete_ms":298,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast","requires-docker","startup-baseline"]} +{"run_id":"554ec2be","scenario_id":"lifecycle.stop-and-resume","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"docker","environment":"universal","model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":"warm","flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":605,"provision_ms":605,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":353,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast","requires-docker","startup-baseline"]} +{"run_id":"554ec2be","scenario_id":"lifecycle.list-sandboxes","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":2,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast"]} +{"run_id":"554ec2be","scenario_id":"lifecycle.health-check","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":9,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":9,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast"]} +{"run_id":"554ec2be","scenario_id":"lifecycle.bare-mode","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":507,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":true,"assertions":[],"timings_raw":[],"tags":["fast","requires-docker","bare"]} +{"run_id":"554ec2be","scenario_id":"lifecycle.provision-breakdown","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1756,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast","requires-docker","timing"]} +{"run_id":"554ec2be","scenario_id":"lifecycle.sdk-roundtrip","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":2427,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast","requires-docker","sdk","timing"]} +{"run_id":"554ec2be","scenario_id":"devcontainer.provision-universal","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":"universal","model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":1871,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["devcontainer","requires-docker","requires-ghcr","slow"]} +{"run_id":"554ec2be","scenario_id":"devcontainer.provision-ethereum","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":"ethereum","model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1724,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["devcontainer","requires-docker","requires-ghcr"]} +{"run_id":"554ec2be","scenario_id":"devcontainer.provision-rust","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":"rust","model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":"rust","cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1923,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["devcontainer","requires-docker","requires-ghcr"]} +{"run_id":"554ec2be","scenario_id":"devcontainer.build-cache-node","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":"node","model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":"node","cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1953,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["devcontainer","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"devcontainer.full-lifecycle","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1872,"provision_ms":1453,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":377,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["devcontainer","requires-docker","requires-llm"]} +{"run_id":"554ec2be","scenario_id":"driver.docker.provision","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"docker","environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1667,"provision_ms":1373,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-docker","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"driver.docker.session","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"docker","environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1839,"provision_ms":1463,"session_create_ms":14,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-docker","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"driver.docker.prompt","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"docker","environment":null,"model":"glm-4.7","backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1735,"provision_ms":1338,"session_create_ms":14,"first_token_ms":1,"task_complete_ms":15,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":4,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-docker","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"driver.docker.delete","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"docker","environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1720,"provision_ms":1363,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":350,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-docker","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"driver.firecracker.provision","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"firecracker","environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":2,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-firecracker","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"driver.firecracker.session","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"firecracker","environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":28,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-firecracker","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"driver.firecracker.prompt","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"firecracker","environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-firecracker","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"driver.firecracker.delete","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":"firecracker","environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":0,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["driver-matrix","driver-firecracker","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.baseline.forge-oz","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1812,"provision_ms":1341,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":86,"build_ms":null,"verify_ms":null,"delete_ms":384,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","strategy-baseline","workload-forge-oz","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.baseline.express-real","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1849,"provision_ms":1425,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":27,"build_ms":null,"verify_ms":null,"delete_ms":397,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","strategy-baseline","workload-express-real","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.baseline.datascience","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1741,"provision_ms":1338,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":26,"build_ms":null,"verify_ms":null,"delete_ms":376,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","strategy-baseline","workload-datascience","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.baseline.cross-stack","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1775,"provision_ms":1330,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":108,"build_ms":null,"verify_ms":null,"delete_ms":336,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","strategy-baseline","workload-cross-stack","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.nix.forge-oz","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1703,"provision_ms":1301,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":25,"build_ms":null,"verify_ms":null,"delete_ms":376,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","strategy-nix","workload-forge-oz","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.nix.express-real","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1773,"provision_ms":1353,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":25,"build_ms":null,"verify_ms":null,"delete_ms":394,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","strategy-nix","workload-express-real","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.nix.datascience","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1748,"provision_ms":1277,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":101,"build_ms":null,"verify_ms":null,"delete_ms":369,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","strategy-nix","workload-datascience","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.nix.cross-stack","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1744,"provision_ms":1347,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":25,"build_ms":null,"verify_ms":null,"delete_ms":371,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","strategy-nix","workload-cross-stack","requires-docker","requires-llm","slow"]} +{"run_id":"554ec2be","scenario_id":"container-strategy.pangolin-preview-link","iteration":1,"timestamp":"2026-04-22T20:01:29.581Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":"go","cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1702,"provision_ms":1298,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":372,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["container-strategy","pangolin","preview-link","requires-docker","requires-llm","requires-pangolin","slow"]} +{"run_id":"554ec2be","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:01:29.582Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3207,"provision_ms":575,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"benchmark.cold-provision-breakdown","iteration":1,"timestamp":"2026-04-22T20:01:29.582Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3384,"provision_ms":1,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":1090,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","startup","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"benchmark.warm-provision","iteration":1,"timestamp":"2026-04-22T20:01:29.582Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"cold_provision","journey":"sandbox_create","measure":"runtime_ready","container_strategy":"new_container","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3376,"provision_ms":0,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":1013,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","startup","requires-docker"]} +{"run_id":"554ec2be","scenario_id":"benchmark.parallel-create-3","iteration":1,"timestamp":"2026-04-22T20:01:29.582Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":3235,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["benchmark","parallel","stress","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.health","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":9,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":9,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.drivers-list","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.backends-list","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.project-lifecycle","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1572,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.project-suspend-resume","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2670,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.sidecar-proxy","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1668,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.concurrent-projects","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":1958,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","stress","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.volume-lifecycle","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.e2e-exec-roundtrip","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2351,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.e2e-file-roundtrip","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"files","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2583,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.e2e-terminal-multi","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3636,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.e2e-git-workflow","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3950,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.e2e-sandbox-info","iteration":1,"timestamp":"2026-04-22T20:01:46.258Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1817,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.e2e-health-after-ops","iteration":1,"timestamp":"2026-04-22T20:01:46.259Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":2102,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:01:46.259Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":62585,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.e2e-pentest-from-api","iteration":1,"timestamp":"2026-04-22T20:01:46.259Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"auth","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":7943,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","pentest","security","requires-docker"]} +{"run_id":"e8b2352c","scenario_id":"direct-api.sidecar-url-available","iteration":1,"timestamp":"2026-04-22T20:01:46.259Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":6939,"provision_ms":1579,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","critical","sidecar","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"resilience.malformed-request","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":4,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast"]} +{"run_id":"a57a1edb","scenario_id":"resilience.double-delete","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1743,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["requires-docker","fast"]} +{"run_id":"a57a1edb","scenario_id":"resilience.get-nonexistent","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["fast"]} +{"run_id":"a57a1edb","scenario_id":"security.nix-mount-readonly","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1811,"provision_ms":1439,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":37,"build_ms":null,"verify_ms":null,"delete_ms":334,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["security","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"security.no-container-escape","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":1832,"provision_ms":1396,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":28,"build_ms":null,"verify_ms":null,"delete_ms":406,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["security","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"security.cross-tenant-isolation","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":3265,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":446,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["security","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"security.resource-limits","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":2058,"provision_ms":1666,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":26,"build_ms":null,"verify_ms":null,"delete_ms":365,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["security","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"chaos.agent-recovers-from-bad-command","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1879,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["chaos","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"chaos.rapid-session-cycling","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":2051,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["chaos","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"chaos.large-output-handling","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1922,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["chaos","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"chaos.concurrent-prompts-same-sandbox","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"chaos","target":"local","pass":false,"error":null,"total_ms":1951,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["chaos","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"chaos.delete-while-running","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"chaos","target":"local","pass":true,"error":null,"total_ms":2820,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["chaos","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"soak.stability-1h","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":1142062,"provision_ms":1392,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":0,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["soak","long-running","requires-docker","requires-llm"]} +{"run_id":"a57a1edb","scenario_id":"auth.oversized-payload","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":7,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["auth","security","red-team"]} +{"run_id":"a57a1edb","scenario_id":"infra.rate-limit-detection","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":73,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["rate-limiting"]} +{"run_id":"a57a1edb","scenario_id":"pentest.fork-bomb-contained","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":6881,"provision_ms":1761,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":498,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","dos","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"pentest.memory-bomb-contained","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":5523,"provision_ms":1813,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":246,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","dos","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"pentest.disk-fill-contained","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":4699,"provision_ms":1857,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":480,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","dos","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"pentest.cpu-spin-contained","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":9276,"provision_ms":1825,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":370,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","dos","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"pentest.rapid-create-delete-cycle","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":8863,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":5,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","api-abuse"]} +{"run_id":"a57a1edb","scenario_id":"pentest.concurrent-race-condition","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2189,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":614,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","api-abuse","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"pentest.websocket-exhaustion","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2283,"provision_ms":1861,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":339,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","dos"]} +{"run_id":"a57a1edb","scenario_id":"pentest.malicious-postinstall","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":33529,"provision_ms":1745,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":244,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","supply-chain","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"pentest.git-clone-data-theft","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":6049,"provision_ms":1874,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":384,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["pentest","security","red-team","supply-chain","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"compliance.multi-tenant-isolation-10","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":14909,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":10,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["compliance","multi-tenant","security","stress","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"compliance.rate-limit-enforcement","iteration":1,"timestamp":"2026-04-22T20:22:36.048Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":5066,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["compliance","rate-limit","security"]} +{"run_id":"a57a1edb","scenario_id":"adversarial.idle-timeout-enforced","iteration":1,"timestamp":"2026-04-22T20:22:36.049Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":44462,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["adversarial","compute-theft","security","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"adversarial.resource-quota-bypass","iteration":1,"timestamp":"2026-04-22T20:22:36.049Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":4447,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["adversarial","compute-theft","security","requires-docker"]} +{"run_id":"a57a1edb","scenario_id":"nation-state.noisy-neighbor-impact","iteration":1,"timestamp":"2026-04-22T20:22:36.049Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":14510,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["nation-state","noisy-neighbor","stress","requires-docker"]} +{"run_id":"3978a938","scenario_id":"sdk-dx.time-to-first-sandbox","iteration":1,"timestamp":"2026-04-22T20:24:32.328Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1606,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"3978a938","scenario_id":"sdk-dx.error-message-quality","iteration":1,"timestamp":"2026-04-22T20:24:32.328Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1671,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx"]} +{"run_id":"3978a938","scenario_id":"sdk-dx.file-operations-roundtrip","iteration":1,"timestamp":"2026-04-22T20:24:32.328Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"smoke","target":"local","pass":false,"error":null,"total_ms":2578,"provision_ms":1456,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.health","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":10,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":10,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"bce494f8","scenario_id":"direct-api.drivers-list","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":2,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"bce494f8","scenario_id":"direct-api.backends-list","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"bce494f8","scenario_id":"direct-api.project-lifecycle","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1624,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.project-suspend-resume","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2570,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.sidecar-proxy","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1861,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.concurrent-projects","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":1850,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","stress","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.volume-lifecycle","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"bce494f8","scenario_id":"direct-api.e2e-exec-roundtrip","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2438,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.e2e-file-roundtrip","iteration":1,"timestamp":"2026-04-22T20:25:13.118Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"files","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2551,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.e2e-terminal-multi","iteration":1,"timestamp":"2026-04-22T20:25:13.119Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3738,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.e2e-git-workflow","iteration":1,"timestamp":"2026-04-22T20:25:13.119Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":4048,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.e2e-sandbox-info","iteration":1,"timestamp":"2026-04-22T20:25:13.119Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1843,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.e2e-health-after-ops","iteration":1,"timestamp":"2026-04-22T20:25:13.119Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":2106,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:25:13.119Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3177,"provision_ms":479,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.e2e-pentest-from-api","iteration":1,"timestamp":"2026-04-22T20:25:13.119Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"auth","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":7646,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","pentest","security","requires-docker"]} +{"run_id":"bce494f8","scenario_id":"direct-api.sidecar-url-available","iteration":1,"timestamp":"2026-04-22T20:25:13.119Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":6958,"provision_ms":1571,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","critical","sidecar","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.health","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":9,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":9,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"40c4e544","scenario_id":"direct-api.drivers-list","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"40c4e544","scenario_id":"direct-api.backends-list","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"40c4e544","scenario_id":"direct-api.project-lifecycle","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1580,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.project-suspend-resume","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2358,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.sidecar-proxy","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1721,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.concurrent-projects","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":1820,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","stress","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.volume-lifecycle","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"40c4e544","scenario_id":"direct-api.e2e-exec-roundtrip","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2436,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.e2e-file-roundtrip","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"files","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2515,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.e2e-terminal-multi","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3603,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.e2e-git-workflow","iteration":1,"timestamp":"2026-04-22T20:33:10.495Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":4031,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.e2e-sandbox-info","iteration":1,"timestamp":"2026-04-22T20:33:10.496Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1880,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.e2e-health-after-ops","iteration":1,"timestamp":"2026-04-22T20:33:10.496Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":2135,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:33:10.496Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3185,"provision_ms":541,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.e2e-pentest-from-api","iteration":1,"timestamp":"2026-04-22T20:33:10.496Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"auth","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":7660,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","pentest","security","requires-docker"]} +{"run_id":"40c4e544","scenario_id":"direct-api.sidecar-url-available","iteration":1,"timestamp":"2026-04-22T20:33:10.496Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":6782,"provision_ms":1411,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","critical","sidecar","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.health","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":9,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":9,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"0478a80c","scenario_id":"direct-api.drivers-list","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"0478a80c","scenario_id":"direct-api.backends-list","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"0478a80c","scenario_id":"direct-api.project-lifecycle","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1837,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.project-suspend-resume","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2423,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.sidecar-proxy","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1793,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.concurrent-projects","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":1909,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","stress","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.volume-lifecycle","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"0478a80c","scenario_id":"direct-api.e2e-exec-roundtrip","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2404,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.e2e-file-roundtrip","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"files","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2560,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.e2e-terminal-multi","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3664,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.e2e-git-workflow","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":4355,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.e2e-sandbox-info","iteration":1,"timestamp":"2026-04-22T20:36:31.377Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1888,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.e2e-health-after-ops","iteration":1,"timestamp":"2026-04-22T20:36:31.378Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":2207,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:36:31.378Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3266,"provision_ms":488,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.e2e-pentest-from-api","iteration":1,"timestamp":"2026-04-22T20:36:31.378Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"auth","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":7712,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","pentest","security","requires-docker"]} +{"run_id":"0478a80c","scenario_id":"direct-api.sidecar-url-available","iteration":1,"timestamp":"2026-04-22T20:36:31.378Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":6904,"provision_ms":1445,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","critical","sidecar","requires-docker"]} +{"run_id":"46977730","scenario_id":"sdk-dx.time-to-first-sandbox","iteration":1,"timestamp":"2026-04-22T20:45:57.700Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1685,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"46977730","scenario_id":"sdk-dx.error-message-quality","iteration":1,"timestamp":"2026-04-22T20:45:57.700Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":313,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx"]} +{"run_id":"46977730","scenario_id":"sdk-dx.file-operations-roundtrip","iteration":1,"timestamp":"2026-04-22T20:45:57.700Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1612,"provision_ms":1200,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.health","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":9,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":9,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"b422d07c","scenario_id":"direct-api.drivers-list","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"b422d07c","scenario_id":"direct-api.backends-list","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"b422d07c","scenario_id":"direct-api.project-lifecycle","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1710,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.project-suspend-resume","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2697,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.sidecar-proxy","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1829,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.concurrent-projects","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":1744,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","stress","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.volume-lifecycle","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api"]} +{"run_id":"b422d07c","scenario_id":"direct-api.e2e-exec-roundtrip","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2456,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.e2e-file-roundtrip","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"files","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2602,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.e2e-terminal-multi","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3686,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.e2e-git-workflow","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"terminal","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":4047,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.e2e-sandbox-info","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1763,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.e2e-health-after-ops","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2202,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3336,"provision_ms":504,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.e2e-pentest-from-api","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"auth","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":7708,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","pentest","security","requires-docker"]} +{"run_id":"b422d07c","scenario_id":"direct-api.sidecar-url-available","iteration":1,"timestamp":"2026-04-22T20:46:36.763Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":6917,"provision_ms":1542,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","critical","sidecar","requires-docker"]} +{"run_id":"5412fa26","scenario_id":"sdk-dx.time-to-first-sandbox","iteration":1,"timestamp":"2026-04-22T20:47:08.054Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1678,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"5412fa26","scenario_id":"sdk-dx.error-message-quality","iteration":1,"timestamp":"2026-04-22T20:47:08.054Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":202,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx"]} +{"run_id":"5412fa26","scenario_id":"sdk-dx.file-operations-roundtrip","iteration":1,"timestamp":"2026-04-22T20:47:08.054Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1729,"provision_ms":1235,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"c5760740","scenario_id":"sdk-dx.time-to-first-sandbox","iteration":1,"timestamp":"2026-04-22T20:49:00.435Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1602,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"c5760740","scenario_id":"sdk-dx.error-message-quality","iteration":1,"timestamp":"2026-04-22T20:49:00.436Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":174,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx"]} +{"run_id":"c5760740","scenario_id":"sdk-dx.file-operations-roundtrip","iteration":1,"timestamp":"2026-04-22T20:49:00.436Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"sdk","difficulty":"smoke","target":"local","pass":true,"error":null,"total_ms":1728,"provision_ms":1252,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["sdk-dx","requires-docker"]} +{"run_id":"542cc615","scenario_id":"infra.snapshot-lifecycle","iteration":1,"timestamp":"2026-04-22T20:49:26.216Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1654,"provision_ms":1243,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":408,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["requires-docker","snapshots"]} +{"run_id":"b405f00c","scenario_id":"infra.snapshot-lifecycle","iteration":1,"timestamp":"2026-04-22T20:55:17.110Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":1658,"provision_ms":1235,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":419,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["requires-docker","snapshots"]} +{"run_id":"dde2b799","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:55:19.052Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":3599,"provision_ms":490,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"1e45e77e","scenario_id":"infra.snapshot-lifecycle","iteration":1,"timestamp":"2026-04-22T20:59:14.384Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"infrastructure","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":1664,"provision_ms":1235,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":425,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["requires-docker","snapshots"]} +{"run_id":"50463a8e","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:59:16.255Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":3535,"provision_ms":499,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"5d0d51df","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T20:59:56.471Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":3546,"provision_ms":532,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"5b08d53d","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T21:04:01.033Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":192,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"80826e6a","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T21:06:29.587Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":false,"error":null,"total_ms":3855,"provision_ms":511,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"9da8f5c3","scenario_id":"direct-api.e2e-stop-resume-exec","iteration":1,"timestamp":"2026-04-22T21:07:56.502Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":"opencode","backend_profile":null,"backend_profile_shape":"none","language":null,"cache_state":null,"flow":"resume_suspended","journey":"sandbox_resume","measure":"runtime_ready","container_strategy":"resume_suspended","category":"lifecycle","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":3635,"provision_ms":490,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["direct-api","e2e","requires-docker"]} +{"run_id":"1ea76111","scenario_id":"devcontainer.build-cache-node","iteration":1,"timestamp":"2026-04-22T21:21:37.440Z","driver":null,"environment":"node","model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":"node","cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"stress","target":"local","pass":false,"error":null,"total_ms":30,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":null,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["devcontainer","requires-docker","requires-llm","slow"]} +{"run_id":"50540808","scenario_id":"security.nix-mount-readonly","iteration":1,"timestamp":"2026-04-22T21:24:54.950Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"standard","target":"local","pass":true,"error":null,"total_ms":2087,"provision_ms":1210,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":482,"build_ms":null,"verify_ms":null,"delete_ms":393,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["security","requires-docker","requires-llm"]} +{"run_id":"50540808","scenario_id":"security.no-container-escape","iteration":1,"timestamp":"2026-04-22T21:24:54.950Z","driver":null,"environment":null,"model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":null,"cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"resilience","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":2265,"provision_ms":1355,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":479,"build_ms":null,"verify_ms":null,"delete_ms":430,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["security","requires-docker","requires-llm"]} +{"run_id":"9a94201a","scenario_id":"devcontainer.build-cache-node","iteration":1,"timestamp":"2026-04-22T21:26:28.649Z","driver":null,"environment":"node","model":null,"backend":null,"backend_type":null,"backend_profile":null,"backend_profile_shape":null,"language":"node","cache_state":null,"flow":null,"journey":null,"measure":null,"container_strategy":null,"category":"lifecycle","difficulty":"stress","target":"local","pass":true,"error":null,"total_ms":56611,"provision_ms":null,"session_create_ms":null,"first_token_ms":null,"task_complete_ms":null,"build_ms":56193,"verify_ms":null,"delete_ms":null,"sse_connect_ms":null,"event_count":null,"token_count":null,"file_count":null,"sandbox_count":null,"input_tokens":null,"output_tokens":null,"cost_usd":null,"cost_source":null,"sandbox_margin_usd":null,"tool_calls_total":null,"tool_calls_success":null,"tool_calls_failed":null,"build_exit_code":null,"build_errors":[],"agent_turns":null,"agent_status":null,"docker_pull_ms":null,"docker_create_ms":null,"docker_start_ms":null,"sidecar_boot_ms":null,"health_check_ms":null,"peak_memory_mb":null,"avg_cpu_percent":null,"cache_hit_rate":null,"install_cold_ms":null,"install_warm_ms":null,"tokens_per_second":null,"inter_token_avg_ms":null,"stream_interruptions":null,"error_code":null,"error_phase":null,"error_driver":null,"bare":false,"assertions":[],"timings_raw":[],"tags":["devcontainer","requires-docker","requires-llm","slow"]} diff --git a/apps/host-agent/src/config.ts b/apps/host-agent/src/config.ts index d4ba8ab..0a009b8 100644 --- a/apps/host-agent/src/config.ts +++ b/apps/host-agent/src/config.ts @@ -169,13 +169,15 @@ export interface HostAgentConfig { tls: TlsConfig;
/**
- at first streamed turn with
-
- Per-sandbox egress proxy. When enabled the host-agent creates an
-
- iron-proxy container on each sandbox's bridge network, registers
-
- per-sandbox resolved secrets handed over from the orchestrator,
-
- and injects DNS/CA/HTTP_PROXY env vars into the sandbox before
-
- start. Gated by
HOST_AGENT_EGRESS_ENABLED(default: off) so
- start. Gated by
-
- existing deployments are not impacted until the iron-proxy image
-
- is pushed to the host.
-
- Egress proxy support for the host-agent Docker runtime path. When
-
- enabled the host-agent creates a per-sandbox iron-proxy container on
-
- the sandbox bridge network, registers the resolved secrets handed
-
- over from the orchestrator, and injects DNS/CA/proxy env vars into
-
- the sandbox before start. Firecracker uses a separate per-host path.
-
-
- Gated by
HOST_AGENT_EGRESS_ENABLED(default: off) so existing
- Gated by
-
- deployments are not impacted until the iron-proxy image is present
-
- on the host. */ egress: { enabled: boolean; diff --git a/apps/host-agent/src/routes/runtime.ts b/apps/host-agent/src/routes/runtime.ts index e52d408..f9f77c4 100644 --- a/apps/host-agent/src/routes/runtime.ts +++ b/apps/host-agent/src/routes/runtime.ts @@ -62,9 +62,30 @@ const ROUTE_CACHE_MAX_ENTRIES = Number.parseInt( 10, ); const DEFAULT_WORKSPACE_ROOT = "/home/agent"; +const EGRESS_PROXY_CERT_CONTAINER_PATH =
- "/usr/local/share/ca-certificates/egress-proxy.crt"; +const RESERVED_SIDECAR_ENV_PREFIX = "SIDECAR_"; +const RESERVED_RUNTIME_ENV_KEYS = new Set(["CONTAINER_ID", "STORAGE_PATH"]); const routeCache = new Map<string, SidecarRouteInfo>(); const routeCacheInFlight = new Map<string, Promise>();
+function sanitizeExplicitWorkspaceRoot(
- value: string | undefined, +): string | undefined {
- const trimmed = value?.trim();
- if (!trimmed) return undefined;
- if (!trimmed.startsWith("/")) return undefined;
- if (
- trimmed === "/" ||
- trimmed.includes("\0") ||
- trimmed.includes("..") ||
- /^/(?:proc|sys|dev|etc|root|tmp)(?:/|$)/.test(trimmed)
- ) {
- return undefined;
- }
- return trimmed; +}
function applyContainerWorkspaceEnvironment(options: { env: Record<string, string>; sessionId: string; @@ -72,7 +93,9 @@ function applyContainerWorkspaceEnvironment(options: { gitWorkspacePath?: string; enableSessionWorkspace: boolean; }): void {
- const explicitWorkspaceRoot = options.env.AGENT_WORKSPACE_ROOT?.trim();
-
const explicitWorkspaceRoot = sanitizeExplicitWorkspaceRoot(
-
options.env.AGENT_WORKSPACE_ROOT,
-
); const sessionWorkspaceRoot =
${DEFAULT_WORKSPACE_ROOT}/${options.sessionId}; const workspaceRoot = explicitWorkspaceRoot ? explicitWorkspaceRoot @@ -686,10 +709,25 @@ export function createRuntimeApp(options: { }const env = { ...body.env }; -
for (const key of Object.keys(env)) { -
if ( -
key.startsWith(RESERVED_SIDECAR_ENV_PREFIX) || -
RESERVED_RUNTIME_ENV_KEYS.has(key) -
) { -
delete env[key]; -
} -
} if (egressProxyIp) { env.EGRESS_PROXY_IP = egressProxyIp; env.HTTPS_PROXY = `http://${egressProxyIp}:1080`; env.HTTP_PROXY = `http://${egressProxyIp}:80`; -
env.NO_PROXY = "localhost,127.0.0.1"; -
if (egressCaCertPath) { -
env.SSL_CERT_FILE ??= EGRESS_PROXY_CERT_CONTAINER_PATH; -
env.GIT_SSL_CAINFO ??= EGRESS_PROXY_CERT_CONTAINER_PATH; -
env.CURL_CA_BUNDLE ??= EGRESS_PROXY_CERT_CONTAINER_PATH; -
env.NODE_EXTRA_CA_CERTS ??= EGRESS_PROXY_CERT_CONTAINER_PATH; -
} } env.SIDECAR_AUTH_TOKEN = config.sidecarAuthToken; // node-pty is installed globally in base images. The sidecar uses
@@ -789,7 +827,7 @@ export function createRuntimeApp(options: { // certificate errors. Matches the orchestrator Docker driver path. if (egressCaCertPath) { binds.push(
-
`${egressCaCertPath}:/usr/local/share/ca-certificates/egress-proxy.crt:ro`,
-
`${egressCaCertPath}:${EGRESS_PROXY_CERT_CONTAINER_PATH}:ro`, ); }
diff --git a/apps/host-agent/src/server.ts b/apps/host-agent/src/server.ts index 060d02f..2644f49 100644 --- a/apps/host-agent/src/server.ts +++ b/apps/host-agent/src/server.ts @@ -153,9 +153,10 @@ async function main() { }); });
- // Egress manager — creates iron-proxy per-sandbox on container create,
- // destroys on delete. Gated by HOST_AGENT_EGRESS_ENABLED so existing
- // deployments without the iron-proxy image available keep working.
- // Docker-runtime egress manager — creates a per-sandbox iron-proxy on
- // container create and destroys it on delete. Firecracker uses the
- // shared host-proxy path instead. Gated by HOST_AGENT_EGRESS_ENABLED
- // so hosts without the iron-proxy image keep working. const egressManager = effectiveConfig.egress.enabled ? new EgressManager(docker) : undefined; diff --git a/apps/host-agent/tests/unit/runtime-app.test.ts b/apps/host-agent/tests/unit/runtime-app.test.ts index c8a864e..9f90dbf 100644 --- a/apps/host-agent/tests/unit/runtime-app.test.ts +++ b/apps/host-agent/tests/unit/runtime-app.test.ts @@ -38,6 +38,245 @@ afterEach(() => { });
describe("createRuntimeApp /v1/containers", () => {
- it("injects egress proxy env, CA trust, DNS, and readonly rootfs when egress is enabled", async () => {
- const config = createTestConfig();
- const created: { args?: any } = {};
- const egressManager = {
-
createProxy: vi.fn(async () => ({ -
containerId: "proxy-1", -
proxyIp: "172.20.0.5", -
caCertPath: "/tmp/egress/ca.crt", -
sessionId: "sess-egress", -
networkName: "test-net", -
})), - };
- const docker = {
-
ping: vi.fn(async () => undefined), -
listNetworks: vi.fn(async () => []), -
createNetwork: vi.fn(async () => ({ Id: "net-1" })), -
getImage: mockGetImage(), -
createContainer: vi.fn(async (args: any) => { -
created.args = args; -
return { -
inspect: vi.fn(async () => -
createManagedInspect({ -
id: "container-egress", -
name: args.name, -
image: args.Image, -
labels: args.Labels, -
env: args.Env, -
status: "created", -
}), -
), -
}; -
}), - } as any;
- const app = createRuntimeApp({
-
docker, -
config, -
imageCache: createTestImageCache(), -
egressManager: egressManager as any, - });
- const res = await app.request("/v1/containers", {
-
method: "POST", -
headers: { "content-type": "application/json" }, -
body: JSON.stringify({ -
sessionId: "sess-egress", -
image: "node:24-alpine", -
command: ["sh", "-lc", "echo hi"], -
env: { OPENCODE_MODEL_API_KEY: "proxy-model-token" }, -
labels: { "agent.project-ref": "project-egress" }, -
resources: { cpu: 1, memory: 256, disk: 1024, pids: 64 }, -
volumes: [], -
network: "test-net", -
egress: { -
enabled: true, -
resolvedSecrets: { -
OPENCODE_MODEL_API_KEY: "sk-real-egress-key", -
}, -
}, -
security: { -
readOnly: false, -
noNewPrivileges: true, -
user: "1000:1000", -
capabilities: { drop: ["ALL"], add: [] }, -
}, -
}), - });
- expect(res.status).toBe(201);
- expect(egressManager.createProxy).toHaveBeenCalledWith(
-
"sess-egress", -
"test-net", -
expect.objectContaining({ enabled: true }), -
{ OPENCODE_MODEL_API_KEY: "sk-real-egress-key" }, - );
- const envMap: Record<string, string> = {};
- for (const entry of created.args.Env as string[]) {
-
const [key, ...valueParts] = entry.split("="); -
envMap[key] = valueParts.join("="); - }
- expect(envMap.OPENCODE_MODEL_API_KEY).toBe("proxy-model-token");
- expect(envMap.EGRESS_PROXY_IP).toBe("172.20.0.5");
- expect(envMap.HTTP_PROXY).toBe("http://172.20.0.5:80");
- expect(envMap.HTTPS_PROXY).toBe("http://172.20.0.5:1080");
- expect(envMap.NO_PROXY).toBe("localhost,127.0.0.1");
- expect(envMap.SSL_CERT_FILE).toBe(
-
"/usr/local/share/ca-certificates/egress-proxy.crt", - );
- expect(envMap.GIT_SSL_CAINFO).toBe(
-
"/usr/local/share/ca-certificates/egress-proxy.crt", - );
- expect(envMap.CURL_CA_BUNDLE).toBe(
-
"/usr/local/share/ca-certificates/egress-proxy.crt", - );
- expect(envMap.NODE_EXTRA_CA_CERTS).toBe(
-
"/usr/local/share/ca-certificates/egress-proxy.crt", - );
- expect(created.args.HostConfig).toEqual(
-
expect.objectContaining({ -
Dns: ["172.20.0.5"], -
ReadonlyRootfs: true, -
Binds: expect.arrayContaining([ -
"/tmp/egress/ca.crt:/usr/local/share/ca-certificates/egress-proxy.crt:ro", -
]), -
}), - );
- });
- it("fails closed when egress proxy creation fails", async () => {
- const config = createTestConfig();
- const docker = {
-
ping: vi.fn(async () => undefined), -
listNetworks: vi.fn(async () => []), -
createNetwork: vi.fn(async () => ({ Id: "net-1" })), -
getImage: mockGetImage(), -
createContainer: vi.fn(), - } as any;
- const egressManager = {
-
createProxy: vi.fn(async () => { -
throw new Error("proxy unavailable"); -
}), - };
- const app = createRuntimeApp({
-
docker, -
config, -
imageCache: createTestImageCache(), -
egressManager: egressManager as any, - });
- const res = await app.request("/v1/containers", {
-
method: "POST", -
headers: { "content-type": "application/json" }, -
body: JSON.stringify({ -
sessionId: "sess-egress-fail", -
image: "node:24-alpine", -
command: ["sh", "-lc", "echo hi"], -
env: {}, -
labels: { "agent.project-ref": "project-egress-fail" }, -
resources: { cpu: 1, memory: 256, disk: 1024, pids: 64 }, -
volumes: [], -
network: "test-net", -
egress: { enabled: true, resolvedSecrets: {} }, -
security: { -
readOnly: false, -
noNewPrivileges: true, -
user: "1000:1000", -
capabilities: { drop: ["ALL"], add: [] }, -
}, -
}), - });
- expect(res.status).toBe(500);
- await expect(res.json()).resolves.toMatchObject({
-
code: "EGRESS_PROXY_FAILED", - });
- expect(docker.createContainer).not.toHaveBeenCalled();
- });
- it("scrubs caller-controlled reserved runtime env before container launch", async () => {
- const config = createTestConfig();
- const created: { args?: any } = {};
- const docker = {
-
ping: vi.fn(async () => undefined), -
listNetworks: vi.fn(async () => []), -
createNetwork: vi.fn(async () => ({ Id: "net-1" })), -
getImage: mockGetImage(), -
createContainer: vi.fn(async (args: any) => { -
created.args = args; -
return { -
inspect: vi.fn(async () => -
createManagedInspect({ -
id: "container-scrub", -
name: args.name, -
image: args.Image, -
labels: args.Labels, -
env: args.Env, -
status: "created", -
}), -
), -
}; -
}), - } as any;
- const app = createRuntimeApp({
-
docker, -
config, -
imageCache: createTestImageCache(), - });
- const res = await app.request("/v1/containers", {
-
method: "POST", -
headers: { "content-type": "application/json" }, -
body: JSON.stringify({ -
sessionId: "sess-scrub", -
image: "node:24-alpine", -
command: ["sh", "-lc", "echo hi"], -
env: { -
SIDECAR_AUTH_DISABLED: "true", -
SIDECAR_DEBUG_ENABLED: "true", -
AGENT_WORKSPACE_ROOT: "/tmp/attacker-owned", -
CONTAINER_ID: "attacker-controlled", -
STORAGE_PATH: "/tmp/attacker-state", -
}, -
labels: { "agent.project-ref": "project-scrub" }, -
resources: { cpu: 1, memory: 256, disk: 1024, pids: 64 }, -
volumes: [], -
network: "test-net", -
security: { -
readOnly: false, -
noNewPrivileges: true, -
user: "1000:1000", -
capabilities: { drop: ["ALL"], add: [] }, -
}, -
}), - });
- expect(res.status).toBe(201);
- const envMap: Record<string, string> = {};
- for (const entry of created.args.Env as string[]) {
-
const [key, ...valueParts] = entry.split("="); -
envMap[key] = valueParts.join("="); - }
- expect(envMap.SIDECAR_AUTH_DISABLED).toBe(
-
process.env.SIDECAR_AUTH_DISABLED, - );
- expect(envMap.SIDECAR_DEBUG_ENABLED).toBeUndefined();
- expect(envMap.CONTAINER_ID).toBeUndefined();
- expect(envMap.AGENT_WORKSPACE_ROOT).not.toBe("/tmp/attacker-owned");
- expect(envMap.STORAGE_PATH).not.toBe("/tmp/attacker-state");
- });
- it("applies security and resource HostConfig knobs", async () => { const config = createTestConfig(); const created: { args?: any } = {}; diff --git a/apps/orchestrator/scripts/benchmark-startup-stream.ts b/apps/orchestrator/scripts/benchmark-startup-stream.ts index fd4c222..c89e096 100644 --- a/apps/orchestrator/scripts/benchmark-startup-stream.ts +++ b/apps/orchestrator/scripts/benchmark-startup-stream.ts @@ -15,8 +15,18 @@ type BenchmarkSample = { recordedAt: string; startupResponseMs: number; readyMs: number;
- sessionCreateMs: number; streamConnectMs: number; firstTokenMs: number;
- secondOutputMs: number | null;
- thirdOutputMs: number | null;
- fifthOutputMs: number | null;
- firstToolInvocationMs: number | null;
- firstTextEventMs: number | null;
- eventCount: number;
- outputEventTypes: string[];
- outputEventTimelineMs: number[];
- interEventAvgMs: number | null; streamTotalMs: number; hostBootstrap?: HostBootstrapRecord | null; }; @@ -107,6 +117,7 @@ type ScenarioResult = { stats: { startupResponse: BenchmarkStats; startupReady: BenchmarkStats;
- sessionCreate: BenchmarkStats; streamConnect: BenchmarkStats; streamFirstToken: BenchmarkStats; streamTotal: BenchmarkStats; @@ -206,6 +217,10 @@ type ScenarioContext = { hostAgent?: Required; };
+function usesContainerWorkspace(driver: string): boolean {
- return driver === "docker" || isHostAgentDriver(driver) || driver === "tangle"; +}
const DEFAULT_API_SECRET_KEY = "0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"; const DEFAULT_SIDECAR_AUTH_TOKEN = @@ -1009,6 +1024,7 @@ async function waitForSidecarHealth( async function connectStreamRequest( scenario: ScenarioContext, sidecarId: string,
- sessionId: string, headers: Record<string, string>, ): Promise { const target = new URL( @@ -1032,6 +1048,7 @@ async function connectStreamRequest( Accept: "text/event-stream", }, body: JSON.stringify({
-
sessionId, message: scenario.message, backend: scenario.backend, }),
@@ -1059,6 +1076,38 @@ async function connectStreamRequest( } }
+async function createAgentSession(
- scenario: ScenarioContext,
- sidecarId: string,
- headers: Record<string, string>, +): Promise {
- const target = new URL(
/sidecars/${sidecarId}/agents/sessions,- scenario.orchestratorUrl,
- ).toString();
- const response = await fetch(target, {
- method: "POST",
- headers: {
-
...headers, -
"Content-Type": "application/json", - },
- body: JSON.stringify({
-
title: `benchmark-${Date.now()}`, -
backend: scenario.backend, - }),
- });
- if (!response.ok) {
- throw new Error(
-
`agents session create failed: ${response.status} ${await response.text()}`, - );
- }
- const payload = (await response.json()) as { id?: string };
- if (!payload.id) {
- throw new Error("agents session create missing id");
- }
- return payload.id; +}
function hasFirstTokenEvent(eventType: string | null, data: unknown): boolean { if (eventType === "token") { return true; @@ -1083,6 +1132,47 @@ function isFirstOutputEvent(eventType: string | null, data: unknown): boolean { return eventType === "result"; }
+function classifyOutputEvent(
- eventType: string | null,
- data: unknown, +): "text" | "tool" | "result" | null {
- if (hasFirstTokenEvent(eventType, data)) {
- return "text";
- }
- if (eventType === "result") {
- return "result";
- }
- if (eventType !== "raw" || data == null) {
- return null;
- }
- const payload =
- typeof data === "string"
-
? data -
: (() => { -
try { -
return JSON.stringify(data); -
} catch { -
return ""; -
} -
})(); - return /tool[-_ ]?invocation|tool_call|tool-use|tool_use/i.test(payload)
- ? "tool"
- : null; +}
+function nthOrNull(values: number[], index: number): number | null {
- return index < values.length ? values[index] : null; +}
+function meanDelta(values: number[]): number | null {
- if (values.length < 2) return null;
- const deltas: number[] = [];
- for (let i = 1; i < values.length; i++) {
- deltas.push(values[i] - values[i - 1]);
- }
- return deltas.reduce((sum, value) => sum + value, 0) / deltas.length; +}
async function startOrchestratorServer(
scenario: ScenarioContext,
): Promise<ChildProcess | null> {
@@ -1230,7 +1320,10 @@ async function runOne(
const scenarioTag = slug(scenario.name);
const userId = bench-user-${scenarioTag}-${now}-${iteration};
const sessionId = bench-session-${scenarioTag}-${now}-${iteration};
- const runtimeHome =
/tmp/agent-bench/${scenarioTag}/${sessionId};
- const hostRuntimeHome =
/tmp/agent-bench/${scenarioTag}/${sessionId}; - const runtimeHome = usesContainerWorkspace(scenario.driver)
- ?
/home/agent/${sessionId} - : hostRuntimeHome;
const authHeaders = {
Authorization:
Bearer ${apiSecretKey}, "x-user-id": userId, @@ -1291,7 +1384,7 @@ async function runOne( let streamReader: ReadableStreamDefaultReader | null = null;
try {
- mkdirSync(runtimeHome, { recursive: true });
-
mkdirSync(hostRuntimeHome, { recursive: true });
const startRequestAt = performance.now(); const maxStartAttempts = @@ -1371,10 +1464,19 @@ async function runOne( ); const readyMs = performance.now() - startRequestAt;
-
const sessionCreateStartedAt = performance.now();
-
const agentSessionId = await createAgentSession(
-
scenario, -
sidecarId, -
authHeaders, -
);
-
const sessionCreateMs = performance.now() - sessionCreateStartedAt;
-
const streamRequestAt = performance.now(); const streamResponse = await connectStreamRequest( scenario, sidecarId,
-
agentSessionId, authHeaders,); const streamConnectMs = performance.now() - streamRequestAt; @@ -1393,10 +1495,15 @@ async function runOne( const parser = new SSEChunkParser(); const decoder = new TextDecoder(); let firstTokenMs = Number.NaN;
-
let firstToolInvocationMs = Number.NaN;
-
let firstTextEventMs = Number.NaN; let sawDone = false; let streamError: string | null = null; let rawStream = ""; const observedEventTypes: string[] = [];
-
const outputEventTimelineMs: number[] = [];
-
const textEventTimelineMs: number[] = [];
-
const outputEventTypes: string[] = [];
const streamTimeoutMs = Number.isFinite(DEFAULT_STREAM_TIMEOUT_MS) ? DEFAULT_STREAM_TIMEOUT_MS @@ -1430,6 +1537,21 @@ async function runOne( if (eventType) { observedEventTypes.push(eventType); }
-
const classified = classifyOutputEvent(eventType, event.data); -
if (classified) { -
const eventMs = performance.now() - streamRequestAt; -
outputEventTimelineMs.push(eventMs); -
outputEventTypes.push(`${classified}:${eventType ?? "unknown"}`); -
if (classified === "tool" && Number.isNaN(firstToolInvocationMs)) { -
firstToolInvocationMs = eventMs; -
} -
if (classified === "text" && Number.isNaN(firstTextEventMs)) { -
firstTextEventMs = eventMs; -
} -
if (classified === "text") { -
textEventTimelineMs.push(eventMs); -
} -
} if ( Number.isNaN(firstTokenMs) && isFirstOutputEvent(eventType, event.data)
@@ -1450,6 +1572,21 @@ async function runOne( if (eventType) { observedEventTypes.push(eventType); }
-
const classified = classifyOutputEvent(eventType, event.data); -
if (classified) { -
const eventMs = performance.now() - streamRequestAt; -
outputEventTimelineMs.push(eventMs); -
outputEventTypes.push(`${classified}:${eventType ?? "unknown"}`); -
if (classified === "tool" && Number.isNaN(firstToolInvocationMs)) { -
firstToolInvocationMs = eventMs; -
} -
if (classified === "text" && Number.isNaN(firstTextEventMs)) { -
firstTextEventMs = eventMs; -
} -
if (classified === "text") { -
textEventTimelineMs.push(eventMs); -
} -
} if ( Number.isNaN(firstTokenMs) && isFirstOutputEvent(eventType, event.data)
@@ -1489,11 +1626,25 @@ async function runOne( recordedAt: new Date().toISOString(), startupResponseMs, readyMs,
-
streamConnectMs, -
firstTokenMs, -
streamTotalMs, -
hostBootstrap, - };
-
sessionCreateMs, -
streamConnectMs, -
firstTokenMs, -
secondOutputMs: nthOrNull(outputEventTimelineMs, 1), -
thirdOutputMs: nthOrNull(outputEventTimelineMs, 2), -
fifthOutputMs: nthOrNull(outputEventTimelineMs, 4), -
firstToolInvocationMs: Number.isFinite(firstToolInvocationMs) -
? firstToolInvocationMs -
: null, -
firstTextEventMs: Number.isFinite(firstTextEventMs) -
? firstTextEventMs -
: null, -
eventCount: outputEventTimelineMs.length, -
outputEventTypes, -
outputEventTimelineMs, -
interEventAvgMs: meanDelta(textEventTimelineMs), -
streamTotalMs, -
hostBootstrap, -
} finally { if (streamReader) { await streamReader.cancel().catch(() => null); @@ -1629,9 +1780,9 @@ async function runScenario(scenario: ScenarioContext): Promise { console.log( ` startup=${sample.startupResponseMs.toFixed(1)}ms ready=${sample.readyMs.toFixed( 1,};
-
)}ms stream_connect=${sample.streamConnectMs.toFixed(
-
)}ms session_create=${sample.sessionCreateMs.toFixed(1)}ms stream_connect=${sample.streamConnectMs.toFixed( 1,
-
)}ms first_token=${sample.firstTokenMs.toFixed(1)}ms stream_total=${sample.streamTotalMs.toFixed(1)}ms`,
-
)}ms first_output=${sample.firstTokenMs.toFixed(1)}ms second_output=${sample.secondOutputMs?.toFixed(1) ?? "n/a"}ms third_output=${sample.thirdOutputMs?.toFixed(1) ?? "n/a"}ms fifth_output=${sample.fifthOutputMs?.toFixed(1) ?? "n/a"}ms first_tool=${sample.firstToolInvocationMs?.toFixed(1) ?? "n/a"}ms stream_total=${sample.streamTotalMs.toFixed(1)}ms events=${sample.eventCount}`, );} const measuredDurationMs = performance.now() - measuredStartAt; @@ -1642,8 +1793,33 @@ async function runScenario(scenario: ScenarioContext): Promise {
const startupStats = computeStats(samples.map((s) => s.startupResponseMs)); const readyStats = computeStats(samples.map((s) => s.readyMs));
-
const sessionCreateStats = computeStats(
-
samples.map((s) => s.sessionCreateMs), -
); const streamConnectStats = computeStats(samples.map((s) => s.streamConnectMs)); const firstTokenStats = computeStats(samples.map((s) => s.firstTokenMs));
-
const secondOutputStats = computeStats(
-
samples.flatMap((s) => -
typeof s.secondOutputMs === "number" ? [s.secondOutputMs] : [], -
), -
);
-
const thirdOutputStats = computeStats(
-
samples.flatMap((s) => -
typeof s.thirdOutputMs === "number" ? [s.thirdOutputMs] : [], -
), -
);
-
const fifthOutputStats = computeStats(
-
samples.flatMap((s) => -
typeof s.fifthOutputMs === "number" ? [s.fifthOutputMs] : [], -
), -
);
-
const firstToolStats = computeStats(
-
samples.flatMap((s) => -
typeof s.firstToolInvocationMs === "number" -
? [s.firstToolInvocationMs] -
: [], -
), -
); const streamTotalStats = computeStats(samples.map((s) => s.streamTotalMs)); const hostAgentSnapshot = scenario.hostAgent ? await fetchHostAgentImageLatency(scenario.hostAgent.url) @@ -1655,8 +1831,13 @@ async function runScenario(scenario: ScenarioContext): Promise { console.log(
\nSummary (${scenario.name})); printStats("startup_response", startupStats); printStats("startup_ready", readyStats); -
printStats("session_create", sessionCreateStats); printStats("stream_connect", streamConnectStats);
- printStats("stream_first_token", firstTokenStats);
- printStats("stream_first_output", firstTokenStats);
- if (secondOutputStats.max > 0) printStats("stream_second_output", secondOutputStats);
- if (thirdOutputStats.max > 0) printStats("stream_third_output", thirdOutputStats);
- if (fifthOutputStats.max > 0) printStats("stream_fifth_output", fifthOutputStats);
- if (firstToolStats.max > 0) printStats("stream_first_tool", firstToolStats);
printStats("stream_total", streamTotalStats);
console.log(
throughput ${throughputOpsPerSec.toFixed(3)} ops/sec, @@ -1731,6 +1912,7 @@ async function runScenario(scenario: ScenarioContext): Promise { stats: { startupResponse: startupStats, startupReady: readyStats, -
sessionCreate: sessionCreateStats, streamConnect: streamConnectStats, streamFirstToken: firstTokenStats, streamTotal: streamTotalStats,
diff --git a/apps/orchestrator/src/constants.ts b/apps/orchestrator/src/constants.ts index 23dfa0d..1b92264 100644 --- a/apps/orchestrator/src/constants.ts +++ b/apps/orchestrator/src/constants.ts @@ -13,13 +13,17 @@ export { /**
- Rate limiting configuration */ +const isLocalDevelopment =
- process.env.NODE_ENV === "development" &&
- process.env.CI !== "true" &&
- process.env.ORCHESTRATOR_RELAXED_RATE_LIMITS === "true"; export const RATE_LIMITS = {
- MANAGER_WINDOW_MS: 60_000, // 1 minute
- MANAGER_MAX_REQUESTS: 100, // 100 requests per minute per IP for manager operations
- AGENTS_WINDOW_MS: 60_000, // 1 minute
- AGENTS_MAX_REQUESTS: 60,
- MEMORIES_WINDOW_MS: 60_000, // 1 minute
- MEMORIES_MAX_REQUESTS: 60, // 60 requests per minute per IP for memory operations
- MANAGER_WINDOW_MS: 60_000,
- MANAGER_MAX_REQUESTS: isLocalDevelopment ? 10_000 : 100,
- AGENTS_WINDOW_MS: 60_000,
- AGENTS_MAX_REQUESTS: isLocalDevelopment ? 10_000 : 60,
- MEMORIES_WINDOW_MS: 60_000,
- MEMORIES_MAX_REQUESTS: isLocalDevelopment ? 10_000 : 60, } as const;
/** diff --git a/apps/orchestrator/src/driver/docker.ts b/apps/orchestrator/src/driver/docker.ts index 0b3861c..52a2c16 100644 --- a/apps/orchestrator/src/driver/docker.ts +++ b/apps/orchestrator/src/driver/docker.ts @@ -1,7 +1,9 @@ import { existsSync, mkdirSync, rmSync, writeFileSync } from "node:fs"; import { join, posix as pathPosix } from "node:path"; import {
- collectExecOutput, createLogger,
- type DockerContainer, deriveSubprocessIdentityEnvFromNumericUserSpec, executeStartupScripts, mergeSubprocessIdentityEnv, @@ -32,6 +34,7 @@ import { EgressManager, swapSecretsForProxyTokens, } from "@/egress/index"; +import { resolveContainerWorkspacePath } from "@/orchestrator/workspace-paths"; import { OrphanTracker } from "@/services/orphan-tracker"; import { StorageOrchestrator } from "@/storage"; import { LvmBlockStorageManager } from "@/storage/block-storage-manager"; @@ -78,6 +81,220 @@ import { import { isNixProfileEnabled, resolveNixProfile } from "./toolchain-paths";
const logger = createLogger("docker-driver"); +const RESERVED_SIDECAR_ENV_PREFIX = "SIDECAR_"; +const RESERVED_RUNTIME_ENV_KEYS = new Set([
- "AGENT_WORKSPACE_ROOT",
- "CONTAINER_ID",
- "STORAGE_PATH", +]); +const EGRESS_PROXY_CERT_CONTAINER_PATH =
- "/usr/local/share/ca-certificates/egress-proxy.crt";
+function getContainerWorkspaceOwner(env: NodeJS.ProcessEnv): string {
- const identityEnv = mergeSubprocessIdentityEnv(
- pickSubprocessIdentityEnv(env),
- );
- if (identityEnv.AGENT_SUBPROCESS_UID && identityEnv.AGENT_SUBPROCESS_GID) {
- return
${identityEnv.AGENT_SUBPROCESS_UID}:${identityEnv.AGENT_SUBPROCESS_GID}; - }
- return identityEnv.AGENT_SUBPROCESS_USER ?? "agent"; +}
+function shellQuote(value: string): string {
- return
'${value.replace(/'/g,'"'"')}'; +}
+function parseWorkspaceOwner(
- owner: string, +):
- | { kind: "uid-gid"; uid: string; gid: string }
- | { kind: "user"; user: string } {
- const numericMatch = owner.match(/^(\d+):(\d+)$/);
- if (numericMatch) {
- return {
-
kind: "uid-gid", -
uid: numericMatch[1], -
gid: numericMatch[2], - };
- }
- return { kind: "user", user: owner }; +}
+function resolveWorkspaceRootForRepair(inspectInfo: {
- Config?: { Env?: string[]; Labels?: Record<string, string> }; +}): string {
- const projectRef = inspectInfo.Config?.Labels?.["agent.project-ref"]?.trim();
- const sessionId = inspectInfo.Config?.Labels?.["agent.session-id"]?.trim();
- if (projectRef) {
- return resolveContainerWorkspacePath(projectRef);
- }
- if (sessionId) {
- return resolveContainerWorkspacePath(sessionId);
- }
- const envEntries = inspectInfo.Config?.Env ?? [];
- for (const entry of envEntries) {
- const separatorIndex = entry.indexOf("=");
- if (separatorIndex <= 0) continue;
- if (entry.slice(0, separatorIndex) !== "AGENT_WORKSPACE_ROOT") continue;
- const value = entry.slice(separatorIndex + 1).trim();
- if (
-
value && -
(value === WORKSPACE_ROOT || value.startsWith(`${WORKSPACE_ROOT}/`)) - ) {
-
return value; - }
- }
- return WORKSPACE_ROOT; +}
+async function inspectExecWithTimeout(
- exec: { inspect?: () => Promise<{ ExitCode?: number | null }> },
- timeoutMs: number,
- label: string, +): Promise<{ ExitCode?: number | null }> {
- if (typeof exec.inspect !== "function") {
- throw new Error(
${label} exec did not expose inspect()); - }
- let timer: ReturnType | undefined;
- try {
- return await Promise.race([
-
exec.inspect(), -
new Promise<never>((_, reject) => { -
timer = setTimeout(() => { -
reject(new Error(`${label} inspect timed out after ${timeoutMs}ms`)); -
}, timeoutMs); -
timer.unref?.(); -
}), - ]);
- } finally {
- if (timer) {
-
clearTimeout(timer); - }
- } +}
+async function waitForExecExitCode(
- exec: { inspect?: () => Promise<{ ExitCode?: number | null }> },
- timeoutMs: number,
- label: string, +): Promise {
- const effectiveTimeoutMs = (() => {
- if (process.env.NODE_ENV !== "test") {
-
return timeoutMs; - }
- const override = Number.parseInt(
-
process.env.ORCHESTRATOR_TEST_EXEC_EXIT_TIMEOUT_MS || "", -
10, - );
- return Number.isFinite(override) && override > 0 ? override : timeoutMs;
- })();
- const deadlineAt = Date.now() + effectiveTimeoutMs;
- while (Date.now() <= deadlineAt) {
- const remainingMs = Math.max(1, deadlineAt - Date.now());
- const inspectTimeoutMs = Math.min(remainingMs, 1000);
- const execInfo = await inspectExecWithTimeout(
-
exec, -
inspectTimeoutMs, -
label, - );
- if (typeof execInfo.ExitCode === "number") {
-
return execInfo.ExitCode; - }
- await new Promise((resolve) => setTimeout(resolve, 100));
- }
- throw new Error(
${label} did not exit within ${effectiveTimeoutMs}ms); +}
+async function repairContainerWorkspaceOwnership(options: {
- docker: Docker;
- container: DockerContainer;
- workspaceRoot: string;
- owner: string; +}): Promise {
- const ownerSpec = parseWorkspaceOwner(options.owner);
- const writablePaths = [
- options.workspaceRoot,
- pathPosix.join(options.workspaceRoot, ".sidecar"),
- pathPosix.join(options.workspaceRoot, ".sidecar/state"),
- pathPosix.join(options.workspaceRoot, ".local"),
- pathPosix.join(options.workspaceRoot, ".local/share"),
- pathPosix.join(options.workspaceRoot, ".local/state"),
- pathPosix.join(options.workspaceRoot, ".cache"),
- pathPosix.join(options.workspaceRoot, ".config"),
- ];
- const ensureDirExec = await options.container.exec({
- Cmd: ["mkdir", "-p", ...writablePaths],
- User: "root",
- AttachStdout: true,
- AttachStderr: true,
- });
- await collectExecOutput(options.docker, ensureDirExec, 5000);
- const ensureDirExitCode = await waitForExecExitCode(
- ensureDirExec,
- 5000,
Workspace mkdir for ${options.workspaceRoot},- );
- if (ensureDirExitCode !== 0) {
- throw new Error(
-
`Workspace mkdir failed for ${options.workspaceRoot} (exit ${ensureDirExitCode})`, - );
- }
- const targetedRepairExec = await options.container.exec({
- Cmd: ["chown", options.owner, ...writablePaths],
- User: "root",
- AttachStdout: true,
- AttachStderr: true,
- });
- const targetedOutput = await collectExecOutput(
- options.docker,
- targetedRepairExec,
- 5000,
- );
- const targetedRepairExitCode = await waitForExecExitCode(
- targetedRepairExec,
- 5000,
Workspace ownership bootstrap for ${options.workspaceRoot},- );
- if (targetedRepairExitCode !== 0) {
- throw new Error(
-
`Workspace ownership bootstrap failed for ${options.workspaceRoot} (exit ${targetedRepairExitCode}): ${targetedOutput.trim()}`, - );
- }
- if (options.workspaceRoot === WORKSPACE_ROOT) {
- return;
- }
- const mismatchPredicate =
- ownerSpec.kind === "uid-gid"
-
? `\\( ! -uid ${shellQuote(ownerSpec.uid)} -o ! -gid ${shellQuote(ownerSpec.gid)} \\)` -
: `! -user ${shellQuote(ownerSpec.user)}`; - const recursiveRepairScript = [
workspace=${shellQuote(options.workspaceRoot)},owner=${shellQuote(options.owner)},if find "$workspace" -xdev ${mismatchPredicate} -print -quit | grep -q .; then,find "$workspace" -xdev ${mismatchPredicate} -exec chown "$owner" {} +,- "fi",
- ].join("\n");
- const chownExec = await options.container.exec({
- Cmd: ["sh", "-lc", recursiveRepairScript],
- User: "root",
- AttachStdout: true,
- AttachStderr: true,
- });
- const output = await collectExecOutput(options.docker, chownExec, 30000);
- const chownExitCode = await waitForExecExitCode(
- chownExec,
- 30000,
Workspace ownership repair for ${options.workspaceRoot},- );
- if (chownExitCode !== 0) {
- throw new Error(
-
`Workspace ownership repair failed for ${options.workspaceRoot} (exit ${chownExitCode}): ${output.trim()}`, - );
- } +}
export interface DockerDriverConfig { socketPath?: string; @@ -557,6 +774,15 @@ export class DockerDriver implements ContainerDriver { ...swappedEnv, };
-
for (const key of Object.keys(env)) { -
if ( -
key.startsWith(RESERVED_SIDECAR_ENV_PREFIX) || -
RESERVED_RUNTIME_ENV_KEYS.has(key) -
) { -
delete env[key]; -
} -
} -
Object.assign( env, mergeSubprocessIdentityEnv(
@@ -576,6 +802,12 @@ export class DockerDriver implements ContainerDriver { env.OPENCODE_MAX_HEAP_MB = String(opencodeHeapMb); }
-
const trustedWorkspaceRoot = -
projectRef?.trim() && projectRef.trim().length > 0 -
? resolveContainerWorkspacePath(projectRef.trim()) -
: config.sessionId?.trim() && config.sessionId.trim().length > 0 -
? resolveContainerWorkspacePath(config.sessionId.trim()) -
: "/home/agent"; const workspaceMountPath = storageMounts?.workspace.containerPath; if (workspaceMountPath) { if (env.AGENT_WORKSPACE_ROOT !== workspaceMountPath) {
@@ -586,8 +818,14 @@ export class DockerDriver implements ContainerDriver { }); } env.AGENT_WORKSPACE_ROOT = workspaceMountPath;
-
env.WORKSPACE_PATH = workspaceMountPath; } else if (!env.AGENT_WORKSPACE_ROOT) {
-
env.AGENT_WORKSPACE_ROOT = "/home/agent";
-
env.AGENT_WORKSPACE_ROOT = trustedWorkspaceRoot; -
if (!env.WORKSPACE_PATH) { -
env.WORKSPACE_PATH = env.AGENT_WORKSPACE_ROOT; -
} -
} else if (!env.WORKSPACE_PATH) { -
env.WORKSPACE_PATH = env.AGENT_WORKSPACE_ROOT; } const baseWorkspaceRoot = env.AGENT_WORKSPACE_ROOT;
@@ -726,6 +964,12 @@ export class DockerDriver implements ContainerDriver {
env.HTTPS_PROXY = http://${egressProxyIp}:1080;
env.HTTP_PROXY = http://${egressProxyIp}:80;
env.NO_PROXY = "localhost,127.0.0.1";
-
if (egressCaCertPath) { -
env.SSL_CERT_FILE ??= EGRESS_PROXY_CERT_CONTAINER_PATH; -
env.GIT_SSL_CAINFO ??= EGRESS_PROXY_CERT_CONTAINER_PATH; -
env.CURL_CA_BUNDLE ??= EGRESS_PROXY_CERT_CONTAINER_PATH; -
env.NODE_EXTRA_CA_CERTS ??= EGRESS_PROXY_CERT_CONTAINER_PATH; -
} } // Inject the shared sidecar auth token so (a) the sidecar can verify
@@ -791,9 +1035,7 @@ export class DockerDriver implements ContainerDriver { // Mount CA cert from egress proxy into the agent container so it // trusts the proxy's MITM leaf certs for HTTPS inspection. const egressBinds: string[] = egressCaCertPath
-
? [ -
`${egressCaCertPath}:/usr/local/share/ca-certificates/egress-proxy.crt:ro`, -
]
-
? [`${egressCaCertPath}:${EGRESS_PROXY_CERT_CONTAINER_PATH}:ro`] : []; const hostConfig: Docker.HostConfig = {
@@ -1288,20 +1530,38 @@ export class DockerDriver implements ContainerDriver { const typedContainer = container as Parameters< typeof startSidecarServer >[0]["container"];
-
const inspectInfo = await container.inspect(); -
const containerEnv = Object.fromEntries( -
(inspectInfo.Config.Env ?? []).map((entry) => { -
const separatorIndex = entry.indexOf("="); -
return separatorIndex >= 0 -
? [ -
entry.slice(0, separatorIndex), -
entry.slice(separatorIndex + 1), -
] -
: [entry, ""]; -
}), -
); -
const workspaceRoot = resolveWorkspaceRootForRepair(inspectInfo); -
const workspaceOwner = getContainerWorkspaceOwner(containerEnv); // Ownership repair MUST complete before sidecar start. The sidecar // demotes child processes (OpenCode, PTY) to uid 1000 (agent). If // workspace dirs aren't owned by agent when these processes start, // they crash with EACCES on mkdir ~/.local/share.
-
// Cargo config can run in parallel with repair since it writes to -
// a root-owned path (/root/.cargo/config.toml). const preStartTasks: Promise<void>[] = [ repairContainerCacheOwnership({ container: typedContainer }),
-
repairContainerWorkspaceOwnership({ -
docker: this.docker, -
container: typedContainer, -
workspaceRoot, -
owner: workspaceOwner, -
}), ]; -
await Promise.all(preStartTasks); if (cargoRegistry) {
-
preStartTasks.push(this.writeCargoConfig(container, cargoRegistry));
-
await this.writeCargoConfig(container, cargoRegistry); }
-
await Promise.all(preStartTasks); // Egress enforcement: if the proxy is active, run the nftables // init script to lock the container's outbound to proxy-only.
diff --git a/apps/orchestrator/src/egress/ARCHITECTURE.md b/apps/orchestrator/src/egress/ARCHITECTURE.md index 8a0f5b3..b7d2fe0 100644 --- a/apps/orchestrator/src/egress/ARCHITECTURE.md +++ b/apps/orchestrator/src/egress/ARCHITECTURE.md @@ -1,10 +1,15 @@
-## Decision: Per-host, not per-sandbox +## Target architecture: per-host, current runtime: mixed
-After initial design review (Gen 3), we pivoted from per-sandbox proxies to -per-host proxies. Both designs are in the codebase; the docker driver's -per-sandbox implementation is kept as a fallback for dev mode. +Per-host remains the target production shape, but that rollout is not +complete. Today the codebase runs a mixed model: +- orchestrator Docker and host-agent Docker runtime paths can provision
- per-sandbox proxies +- Firecracker uses a shared per-host proxy
+This document describes the target shape and explicitly calls out the +remaining gaps so the code and docs do not imply a completed migration.
@@ -47,8 +52,8 @@ Host machine (compute node)
| Driver | Proxy deployment | Who manages |
|---|---|---|
| - | docker (local/dev) |
Per-sandbox container OR shared host process |
| - | host-agent (prod) |
Per-host process, managed by host-agent |
| + | docker (local/dev) |
Per-sandbox container |
| + | host-agent (prod Docker runtime) |
Per-sandbox container |
firecracker (prod) |
Per-host process + TAP bridge routing | Host-agent |
local (unit tests) |
Not applicable | — |
tangle (TEE) |
Not applicable (TEE handles isolation) | — |
| @@ -70,20 +75,19 @@ are loaded lazily per-request: | ||
| Blast radius of a compromised proxy: only keys currently in active | ||
| requests. Not all tenants on the host. |
-## Current implementation status (Gen 3) +## Current implementation status
-- [x] EgressManager (per-sandbox, docker driver) — in apps/orchestrator/src/egress/
-- [x] ContainerConfig.egress field — flows through all drivers
-- [x] nftables enforcement in agent container startup
+- [x] EgressManager (per-sandbox proxy lifecycle) for orchestrator Docker and host-agent Docker runtime paths
+- [x] ContainerConfig.egress wiring through orchestrator, host-agent, and firecracker drivers
+- [x] Secret swap before sandbox launch via swapSecretsForProxyTokens()
+- [x] Firecracker per-host proxy registration via HostProxyManager
- iron-proxy Docker image at
infra/iron-proxy/Dockerfile+ CI workflow -- [ ] Per-host proxy management (host-agent side) — TODO -- [ ] Per-source-IP routing config generation — TODO -- [ ] Hot-reload control API wrapper — TODO +- [ ] One production-default model across every driver +- [ ] Full cross-driver E2E proof for allowlist enforcement, tenant isolation, and CA trust +- [ ] Removal of the remaining dev-only per-sandbox fallback once per-host is the default everywhere
-## Migration path +## Remaining migration work
-1. Ship per-sandbox (current Gen 3) for dev/local docker driver -2. Build per-host impl on host-agent for staging -3. Benchmark: per-sandbox latency vs per-host latency -4. Switch production to per-host via IRON_PROXY_MODE=per-host env flag -5. Keep per-sandbox as fallback for local dev (small N, simple config) +1. Finish one production-default deployment model across Docker and Firecracker +2. Add cross-driver E2E proof for allowlist enforcement, tenant isolation, and CA trust +3. Remove the remaining per-sandbox fallback once the per-host path is the default where intended diff --git a/apps/orchestrator/src/orchestrator/index.ts b/apps/orchestrator/src/orchestrator/index.ts index 8f1e710..f0f80ad 100644 --- a/apps/orchestrator/src/orchestrator/index.ts +++ b/apps/orchestrator/src/orchestrator/index.ts @@ -381,6 +381,16 @@ export default class Orchestrator { isHealthMonitoringEnabled: () => this.#healthCheckInterval !== null, getPangolinLifecycle: () => this.#pangolinLifecycle, getStorageClientRegistry: () => this.#storageClientRegistry,
-
getSnapshotResticPassword: async (projectRef) => { -
if (!this.#snapshotService) { -
throw new Error( -
"Snapshot service not available for restic credential resolution", -
); -
} -
return this.#snapshotService.resolveRequestScopedResticPassword( -
projectRef, -
); -
}, containerPool: options?.containerPool, onSidecarReady: (sidecar) => { this.emit("container:ready", sidecar.id, sidecar);
diff --git a/apps/orchestrator/src/orchestrator/project-manager.ts b/apps/orchestrator/src/orchestrator/project-manager.ts index 0a07827..ed8a8fe 100644 --- a/apps/orchestrator/src/orchestrator/project-manager.ts +++ b/apps/orchestrator/src/orchestrator/project-manager.ts @@ -473,10 +473,17 @@ export class ProjectManager { progress, gitWorkspacePath, );
-
if (snapshotSource && config.storage?.fromSnapshot) { -
containerOptions.fromSnapshot = { -
sourceProjectRef: snapshotSource.projectRef, -
snapshotId: config.storage.fromSnapshot, -
quotaBytes: config.storage.quotaBytes, -
}; -
} const { instance } = await this.deps.sidecarManager.createSidecar(containerOptions);
-
let activeInstance = instance;
-
const activeInstance = instance; benchmarkHostId = activeInstance.metadata.hostId; const createdContainer = await this.deps.driver.getContainer( activeInstance.id,
@@ -495,53 +502,6 @@ export class ProjectManager { if (!snapshotSource) { throw new Error("Snapshot source could not be resolved"); }
-
const labels = createdContainer?.labels ?? {}; -
const hostInfo = extractHostInfoFromLabels(labels); -
if ( -
!hostInfo.hostId || -
!hostInfo.hostAgentUrl || -
!hostInfo.storageMountPoint -
) { -
throw new Error("Project snapshot host information is missing"); -
} -
const suspended = await this.deps.sidecarManager.suspendSidecar( -
activeInstance.id, -
); -
if (!suspended) { -
throw new Error("Failed to stop sidecar before snapshot restore"); -
} -
const restored = await this.deps.snapshotService.restoreFromSnapshot( -
snapshotSource.projectRef, -
config.storage.fromSnapshot, -
hostInfo.hostId, -
hostInfo.hostAgentUrl, -
hostInfo.storageMountPoint, -
hostInfo.storageQuotaBytes ?? config.storage.quotaBytes, -
snapshotSource.customerStorage, -
); -
if (!restored) { -
throw new Error( -
`Snapshot ${config.storage.fromSnapshot} was not found`, -
); -
} -
const resumed = await this.deps.sidecarManager.resumeSidecar( -
activeInstance.id, -
startupStartedAt, -
config.startupId, -
"project_provision", -
); -
if (!resumed) { -
throw new Error("Failed to restart sidecar after snapshot restore"); -
} -
activeInstance = resumed; -
benchmarkHostId = activeInstance.metadata.hostId ?? benchmarkHostId; -
logger.info("Snapshot restored for provisioned project", { projectRef: config.projectRef, sourceProjectRef: snapshotSource.projectRef,
@@ -1358,6 +1318,7 @@ export class ProjectManager { }), ...(mountedWorkspaceRoot && { AGENT_WORKSPACE_ROOT: mountedWorkspaceRoot,
-
WORKSPACE_PATH: mountedWorkspaceRoot, }), }, // Pass git config to the driver for host-agent-side cloning
diff --git a/apps/orchestrator/src/orchestrator/sidecar-manager.ts b/apps/orchestrator/src/orchestrator/sidecar-manager.ts index 84aa954..34984d8 100644 --- a/apps/orchestrator/src/orchestrator/sidecar-manager.ts +++ b/apps/orchestrator/src/orchestrator/sidecar-manager.ts @@ -72,6 +72,7 @@ type Dependencies = { isHealthMonitoringEnabled(): boolean; getPangolinLifecycle(): PangolinLifecycle | null; getStorageClientRegistry(): StorageClientRegistry | null;
-
getSnapshotResticPassword?(projectRef: string): Promise; /** Warm container pool for instant provision */ containerPool?: { claim(): { instance: ContainerInstance; createdAt: number } | null; @@ -1674,6 +1675,13 @@ export class SidecarManager { });
const snapshotRestoreStartedAt = Date.now(); -
const restoreRequestOptions = this.deps.getSnapshotResticPassword -
? { -
resticPassword: await this.deps.getSnapshotResticPassword( -
options.fromSnapshot.sourceProjectRef, -
), -
} -
: undefined; const restored = await storageClient.restoreSnapshot( options.fromSnapshot.sourceProjectRef, storageMountPoint,
@@ -1681,6 +1689,7 @@ export class SidecarManager { snapshotId: options.fromSnapshot.snapshotId, quotaBytes: options.fromSnapshot.quotaBytes, },
-
restoreRequestOptions, ); if (!restored) {
diff --git a/apps/orchestrator/src/routes/projects.ts b/apps/orchestrator/src/routes/projects.ts index 02f78c9..8f7dde0 100644 --- a/apps/orchestrator/src/routes/projects.ts +++ b/apps/orchestrator/src/routes/projects.ts @@ -1529,7 +1529,7 @@ type SnapshotCreateOutcome = } | { ok: false;
-
status: 400 | 404 | 500;
-
status: 400 | 404 | 500 | 501; error: string; code?: string; detail?: string;
@@ -1537,6 +1537,7 @@ type SnapshotCreateOutcome =
const PERSISTENT_WORKSPACE_UNAVAILABLE_ERROR = "persistent workspace unavailable for snapshot"; +const SNAPSHOT_SERVICE_UNAVAILABLE_ERROR = "snapshot service not available";
const createProjectSnapshotRoute = createRoute({ method: "post", @@ -1589,6 +1590,12 @@ const createProjectSnapshotRoute = createRoute({ "application/json": { schema: snapshotErrorSchema }, }, },
- 501: {
-
description: "Snapshot backend unavailable for this driver/runtime", -
content: { -
"application/json": { schema: snapshotErrorSchema }, -
}, - }, }, });
@@ -1642,6 +1649,12 @@ const listProjectSnapshotsRoute = createRoute({ }, }, },
- 501: {
-
description: "Snapshot backend unavailable for this driver/runtime", -
content: { -
"application/json": { schema: snapshotErrorSchema }, -
}, - }, }, });
@@ -1726,6 +1739,12 @@ const restoreProjectSnapshotRoute = createRoute({ }, }, },
- 501: {
-
description: "Snapshot backend unavailable for this driver/runtime", -
content: { -
"application/json": { schema: snapshotErrorSchema }, -
}, - }, }, });
@@ -1811,6 +1830,12 @@ const deleteProjectSnapshotRoute = createRoute({ }, }, },
- 501: {
-
description: "Snapshot backend unavailable for this driver/runtime", -
content: { -
"application/json": { schema: snapshotErrorSchema }, -
}, - }, }, });
@@ -1943,6 +1968,13 @@ function isPersistentWorkspaceUnavailableError(error: unknown): boolean { ); }
+function isSnapshotServiceUnavailableError(error: unknown): boolean {
- return (
- error instanceof Error &&
- error.message === SNAPSHOT_SERVICE_UNAVAILABLE_ERROR
- ); +}
/** Shared handler for both createProjectSnapshotRoute and its alias. */ async function handleCreateSnapshot( orchestrator: Orchestrator, @@ -1991,6 +2023,14 @@ async function handleCreateSnapshot( code: "PERSISTENT_WORKSPACE_UNAVAILABLE", }; }
- if (isSnapshotServiceUnavailableError(error)) {
-
return { -
ok: false, -
status: 501, -
error: SNAPSHOT_SERVICE_UNAVAILABLE_ERROR, -
code: "SNAPSHOT_SERVICE_UNAVAILABLE", -
}; - } const payload = snapshotErrorResponse("create", error); return { ok: false, @@ -2018,7 +2058,7 @@ async function resolveProjectSnapshotContext(
const snapshotService = orchestrator.getSnapshotService(); if (!snapshotService) {
- throw new Error("Snapshot service not available");
- throw new Error(SNAPSHOT_SERVICE_UNAVAILABLE_ERROR); }
const containerId = project.resources.container?.id; @@ -2333,6 +2373,12 @@ app HTTP_STATUS.NOT_FOUND, ); }
- if (outcome.status === 501) {
-
return c.json( -
{ error: outcome.error, code: outcome.code, detail: outcome.detail }, -
HTTP_STATUS.NOT_IMPLEMENTED, -
); - } return c.json( { error: outcome.error, code: outcome.code, detail: outcome.detail }, HTTP_STATUS.INTERNAL_SERVER_ERROR, @@ -2377,6 +2423,12 @@ app HTTP_STATUS.NOT_FOUND, ); }
- if (outcome.status === 501) {
-
return c.json( -
{ error: outcome.error, code: outcome.code, detail: outcome.detail }, -
HTTP_STATUS.NOT_IMPLEMENTED, -
); - } return c.json( { error: outcome.error, code: outcome.code, detail: outcome.detail }, HTTP_STATUS.INTERNAL_SERVER_ERROR, @@ -2422,6 +2474,15 @@ app HTTP_STATUS.OK, ); } catch (error) {
-
if (isSnapshotServiceUnavailableError(error)) { -
return c.json( -
{ -
error: SNAPSHOT_SERVICE_UNAVAILABLE_ERROR, -
code: "SNAPSHOT_SERVICE_UNAVAILABLE", -
}, -
HTTP_STATUS.NOT_IMPLEMENTED, -
); -
} return c.json( snapshotErrorResponse("list", error), HTTP_STATUS.INTERNAL_SERVER_ERROR,
@@ -2521,9 +2582,14 @@ app const mutex = orchestrator.getProjectMutex(projectRef); const result = await mutex.runExclusive(async () => { const shouldResume = context.project.status !== "suspended";
-
const containerId = context.project.resources.container?.id; -
if (!containerId) { -
throw new Error("Project has no container to restore"); -
} if (shouldResume) {
-
await orchestrator.suspendProjectForUser(projectRef, requesterUserId);
-
await orchestrator.suspendSidecar(containerId); } try {
@@ -2540,11 +2606,11 @@ app
if (!restored) {
if (shouldResume) {
-
await orchestrator.resumeProjectForUser( -
projectRef, -
requesterUserId,
-
await orchestrator.resumeSidecar( -
containerId, requestStartTime, startupId, -
"project_resume", ); }
@@ -2552,11 +2618,11 @@ app }
if (shouldResume) {
-
await orchestrator.resumeProjectForUser( -
projectRef, -
requesterUserId,
-
await orchestrator.resumeSidecar( -
containerId, requestStartTime, startupId, -
"project_resume", ); }
@@ -2568,11 +2634,11 @@ app } catch (restoreError) { if (shouldResume) { try {
-
await orchestrator.resumeProjectForUser( -
projectRef, -
requesterUserId,
-
await orchestrator.resumeSidecar( -
containerId, requestStartTime, startupId, -
"project_resume", ); } catch (resumeError) { logger.error("Failed to resume project after restore failure", {
@@ -2619,6 +2685,15 @@ app HTTP_STATUS.BAD_REQUEST, ); }
-
if (isSnapshotServiceUnavailableError(error)) { -
return c.json( -
{ -
error: SNAPSHOT_SERVICE_UNAVAILABLE_ERROR, -
code: "SNAPSHOT_SERVICE_UNAVAILABLE", -
}, -
HTTP_STATUS.NOT_IMPLEMENTED, -
); -
} auditSnapshotEvent("snapshot.restore", projectRef, "failure", { snapshotId, requesterUserId,
@@ -2719,6 +2794,15 @@ app
return c.json({ success: true, deleted: !!deleted }, HTTP_STATUS.OK);
} catch (error) {
-
if (isSnapshotServiceUnavailableError(error)) { -
return c.json( -
{ -
error: SNAPSHOT_SERVICE_UNAVAILABLE_ERROR, -
code: "SNAPSHOT_SERVICE_UNAVAILABLE", -
}, -
HTTP_STATUS.NOT_IMPLEMENTED, -
); -
} auditSnapshotEvent("snapshot.delete", projectRef, "failure", { snapshotId, requesterUserId,
diff --git a/apps/orchestrator/src/services/snapshot-job-queue.ts b/apps/orchestrator/src/services/snapshot-job-queue.ts
index 791c240..04bb538 100644
--- a/apps/orchestrator/src/services/snapshot-job-queue.ts
+++ b/apps/orchestrator/src/services/snapshot-job-queue.ts
@@ -256,13 +256,17 @@ export class SnapshotJobQueue {
if (!project) {
throw new Error(Project ${projectRef} not found);
}
-
const containerId = project.resources.container?.id;
-
if (!containerId) {
-
throw new Error(`Project ${projectRef} has no container to restore`); -
}
// Suspend before restore — only if the project was running when the // restore was requested. shouldResume was captured at enqueue time; // re-deriving here would race with the route-level suspend. if (shouldResume) { await job.updateProgress(10);
-
await orchestrator.suspendProjectForUser(projectRef, requesterUserId);
-
await orchestrator.suspendSidecar(containerId);}
// Guard against shouldResume drift (harden finding R7): if the @@ -294,11 +298,11 @@ export class SnapshotJobQueue { if (!restored) { // Resume even if restore found nothing if (shouldResume) {
-
await orchestrator.resumeProjectForUser( -
projectRef, -
requesterUserId,
-
await orchestrator.resumeSidecar( -
containerId, start, requestId, -
"project_resume", ); } const durationMs = Date.now() - start;
@@ -313,11 +317,11 @@ export class SnapshotJobQueue {
// Resume after successful restore
if (shouldResume) {
-
await orchestrator.resumeProjectForUser( -
projectRef, -
requesterUserId,
-
await orchestrator.resumeSidecar( -
containerId, start, requestId, -
"project_resume", ); }
@@ -355,11 +359,11 @@ export class SnapshotJobQueue { // Best-effort resume on failure if (shouldResume) { try {
-
await orchestrator.resumeProjectForUser( -
projectRef, -
requesterUserId,
-
await orchestrator.resumeSidecar( -
containerId, start, requestId, -
"project_resume", ); } catch (resumeError) { logger.error("Failed to resume project after restore failure", {
diff --git a/apps/orchestrator/src/services/snapshot-service.ts b/apps/orchestrator/src/services/snapshot-service.ts index b850cfd..31675f0 100644 --- a/apps/orchestrator/src/services/snapshot-service.ts +++ b/apps/orchestrator/src/services/snapshot-service.ts @@ -198,6 +198,12 @@ export class SnapshotService { return key; }
- async resolveRequestScopedResticPassword(
- projectRef: string,
- ): Promise {
- return this.resolveResticPassword(projectRef);
- }
- private getCustomerManagerKey( projectRef: string, cfg: CustomerStorageConfig, diff --git a/apps/orchestrator/tests/unit/constants.test.ts b/apps/orchestrator/tests/unit/constants.test.ts new file mode 100644 index 0000000..48a4f70 --- /dev/null +++ b/apps/orchestrator/tests/unit/constants.test.ts @@ -0,0 +1,73 @@ +import { afterEach, describe, expect, it, vi } from "vitest";
+const originalNodeEnv = process.env.NODE_ENV; +const originalCi = process.env.CI; +const originalRelaxed = process.env.ORCHESTRATOR_RELAXED_RATE_LIMITS; + +afterEach(() => {
- vi.resetModules();
- if (originalNodeEnv === undefined) {
- delete process.env.NODE_ENV;
- } else {
- process.env.NODE_ENV = originalNodeEnv;
- }
- if (originalCi === undefined) {
- delete process.env.CI;
- } else {
- process.env.CI = originalCi;
- }
- if (originalRelaxed === undefined) {
- delete process.env.ORCHESTRATOR_RELAXED_RATE_LIMITS;
- } else {
- process.env.ORCHESTRATOR_RELAXED_RATE_LIMITS = originalRelaxed;
- } +});
+describe("orchestrator rate limits", () => {
-
it("keeps staging-like environments on production-safe limits", async () => {
-
process.env.NODE_ENV = "staging";
-
process.env.CI = "false";
-
const { RATE_LIMITS } = await import("../../src/constants");
-
expect(RATE_LIMITS.MANAGER_MAX_REQUESTS).toBe(100);
-
expect(RATE_LIMITS.AGENTS_MAX_REQUESTS).toBe(60);
-
expect(RATE_LIMITS.MEMORIES_MAX_REQUESTS).toBe(60);
-
});
-
it("only relaxes limits in explicit local development", async () => {
-
process.env.NODE_ENV = "development";
-
process.env.CI = "false";
-
process.env.ORCHESTRATOR_RELAXED_RATE_LIMITS = "true";
-
const { RATE_LIMITS } = await import("../../src/constants");
-
expect(RATE_LIMITS.MANAGER_MAX_REQUESTS).toBe(10_000);
-
expect(RATE_LIMITS.AGENTS_MAX_REQUESTS).toBe(10_000);
-
expect(RATE_LIMITS.MEMORIES_MAX_REQUESTS).toBe(10_000);
-
});
-
it("does not relax limits for development when CI is true", async () => {
-
process.env.NODE_ENV = "development";
-
process.env.CI = "true";
-
process.env.ORCHESTRATOR_RELAXED_RATE_LIMITS = "true";
-
const { RATE_LIMITS } = await import("../../src/constants");
-
expect(RATE_LIMITS.MANAGER_MAX_REQUESTS).toBe(100);
-
expect(RATE_LIMITS.AGENTS_MAX_REQUESTS).toBe(60);
-
expect(RATE_LIMITS.MEMORIES_MAX_REQUESTS).toBe(60);
-
});
-
it("does not relax limits without the explicit opt-in flag", async () => {
-
process.env.NODE_ENV = "development";
-
process.env.CI = "false";
-
delete process.env.ORCHESTRATOR_RELAXED_RATE_LIMITS;
-
const { RATE_LIMITS } = await import("../../src/constants");
-
expect(RATE_LIMITS.MANAGER_MAX_REQUESTS).toBe(100);
-
expect(RATE_LIMITS.AGENTS_MAX_REQUESTS).toBe(60);
-
expect(RATE_LIMITS.MEMORIES_MAX_REQUESTS).toBe(60);
-
}); +}); diff --git a/apps/orchestrator/tests/unit/docker-driver-envs.test.ts b/apps/orchestrator/tests/unit/docker-driver-envs.test.ts index 8d3ed77..0b77440 100644 --- a/apps/orchestrator/tests/unit/docker-driver-envs.test.ts +++ b/apps/orchestrator/tests/unit/docker-driver-envs.test.ts @@ -144,6 +144,7 @@ describe("DockerDriver - Runtime Cache Envs", () => { expect(envMap.PIP_CACHE_DIR).toBe("/home/agent/project-1/.cache/pip"); expect(envMap.GOMODCACHE).toBe("/home/agent/project-1/.cache/go"); expect(envMap.AGENT_WORKSPACE_ROOT).toBe("/home/agent/project-1");
-
expect(envMap.WORKSPACE_PATH).toBe("/home/agent/project-1"); expect(envMap.STORAGE_PATH).toBe("/home/agent/project-1/.sidecar/state");
// Verify XDG directories @@ -377,6 +378,119 @@ describe("DockerDriver - Runtime Cache Envs", () => { expect(envMap.AGENT_SUBPROCESS_GID).toBe("1000"); });
-
it("scrubs user-supplied SIDECAR_* control flags from container env", async () => {
-
await driver.createContainer({
-
image: "test-image", -
sessionId: "session-sidecar-flags", -
resources: { cpu: 1, memory: 1024, disk: 1024, pids: 100 }, -
labels: { "agent.project-ref": "project-sidecar-flags" }, -
env: { -
SIDECAR_RELAXED_RATE_LIMITS: "true", -
SIDECAR_AUTH_DISABLED: "true", -
SIDECAR_DEBUG_ENABLED: "true", -
AGENT_WORKSPACE_ROOT: "/tmp/attacker-owned", -
CONTAINER_ID: "attacker-controlled", -
STORAGE_PATH: "/tmp/attacker-state", -
}, -
volumes: [], -
network: "default", -
security: { -
readOnly: false, -
noNewPrivileges: false, -
user: "root", -
capabilities: { drop: [], add: [] }, -
}, -
});
-
const createCall = mockDocker.createContainer.mock.calls.at(-1)?.[0];
-
const envs = createCall.Env;
-
const envMap: Record<string, string> = {};
-
envs.forEach((e: string) => {
-
const [k, v] = e.split("="); -
envMap[k] = v; -
});
-
expect(envMap.SIDECAR_RELAXED_RATE_LIMITS).toBeUndefined();
-
expect(envMap.SIDECAR_DEBUG_ENABLED).toBeUndefined();
-
expect(envMap.SIDECAR_AUTH_DISABLED).toBe(
-
process.env.SIDECAR_AUTH_DISABLED, -
);
-
expect(envMap.CONTAINER_ID).toBeUndefined();
-
expect(envMap.AGENT_WORKSPACE_ROOT).toBe("/home/agent/project-1");
-
expect(envMap.WORKSPACE_PATH).toBe("/home/agent/project-1");
-
expect(envMap.STORAGE_PATH).toBe("/home/agent/project-1/.sidecar/state");
-
});
-
it("routes egress-enabled sandboxes through iron-proxy with explicit CA trust", async () => {
-
const createProxy = vi.fn().mockResolvedValue({
-
containerId: "proxy-container-1", -
proxyIp: "172.20.0.7", -
caCertPath: "/tmp/egress/ca.crt", -
sessionId: "session-egress", -
networkName: "agent-net-session-egress", -
});
-
(driver as any).egressManager = { createProxy };
-
await driver.createContainer({
-
image: "test-image", -
sessionId: "session-egress", -
resources: { cpu: 1, memory: 1024, disk: 1024, pids: 100 }, -
labels: { "agent.project-ref": "project-1" }, -
env: { -
OPENCODE_MODEL_API_KEY: "sk-real-egress", -
}, -
volumes: [], -
network: "default", -
security: { -
readOnly: false, -
noNewPrivileges: false, -
user: "root", -
capabilities: { drop: [], add: [] }, -
}, -
});
-
expect(createProxy).toHaveBeenCalledWith(
-
"session-egress", -
"agent-net-session-egress", -
expect.objectContaining({ enabled: true }), -
{ OPENCODE_MODEL_API_KEY: "sk-real-egress" }, -
);
-
const createCall = mockDocker.createContainer.mock.calls.at(-1)?.[0];
-
const envMap: Record<string, string> = {};
-
for (const entry of createCall.Env as string[]) {
-
const [key, ...valueParts] = entry.split("="); -
envMap[key] = valueParts.join("="); -
}
-
expect(envMap.OPENCODE_MODEL_API_KEY).toBe("proxy-model-token");
-
expect(envMap.EGRESS_PROXY_ENABLED).toBe("true");
-
expect(envMap.EGRESS_PROXY_IP).toBe("172.20.0.7");
-
expect(envMap.HTTP_PROXY).toBe("http://172.20.0.7:80");
-
expect(envMap.HTTPS_PROXY).toBe("http://172.20.0.7:1080");
-
expect(envMap.NO_PROXY).toBe("localhost,127.0.0.1");
-
expect(envMap.SSL_CERT_FILE).toBe(
-
"/usr/local/share/ca-certificates/egress-proxy.crt", -
);
-
expect(envMap.GIT_SSL_CAINFO).toBe(
-
"/usr/local/share/ca-certificates/egress-proxy.crt", -
);
-
expect(envMap.CURL_CA_BUNDLE).toBe(
-
"/usr/local/share/ca-certificates/egress-proxy.crt", -
);
-
expect(envMap.NODE_EXTRA_CA_CERTS).toBe(
-
"/usr/local/share/ca-certificates/egress-proxy.crt", -
);
-
expect(createCall.HostConfig.Dns).toEqual(["172.20.0.7"]);
-
expect(createCall.HostConfig.ReadonlyRootfs).toBe(true);
-
expect(createCall.HostConfig.Binds).toEqual(
-
expect.arrayContaining([ -
"/tmp/egress/ca.crt:/usr/local/share/ca-certificates/egress-proxy.crt:ro", -
]), -
);
-
});
-
it("sets OPENCODE_MAX_HEAP_MB based on container memory", async () => { await driver.createContainer({ image: "test-image", diff --git a/apps/orchestrator/tests/unit/docker-driver-startsidecar.test.ts b/apps/orchestrator/tests/unit/docker-driver-startsidecar.test.ts index d477761..201d66b 100644 --- a/apps/orchestrator/tests/unit/docker-driver-startsidecar.test.ts +++ b/apps/orchestrator/tests/unit/docker-driver-startsidecar.test.ts @@ -24,6 +24,19 @@ vi.mock("@repo/shared", async (importOriginal) => { }; });
+function createMockExecStream() {
- return {
- destroy: vi.fn(),
- resume: vi.fn(),
- removeListener: vi.fn(),
- on: vi.fn((event: string, handler: () => void) => {
-
if (event === "end") { -
setImmediate(handler); -
} - }),
- }; +}
describe("DockerDriver.startContainer sidecar bootstrap path", () => { let driver: DockerDriver; let mockDocker: { getContainer: ReturnType }; @@ -39,24 +52,28 @@ describe("DockerDriver.startContainer sidecar bootstrap path", () => { start: vi.fn().mockResolvedValue(undefined), inspect: vi.fn().mockResolvedValue({ Config: {
-
Env: ["NPM_CONFIG_CACHE=/home/agent/.cache/npm"],
-
Env: [ -
"NPM_CONFIG_CACHE=/home/agent/.cache/npm", -
"AGENT_WORKSPACE_ROOT=/home/agent", -
], -
Labels: { -
"agent.session-id": "session-1", -
}, }, }), exec: vi.fn().mockResolvedValue({
-
start: vi.fn().mockResolvedValue({ -
resume: vi.fn(), -
on: vi.fn((event: string, handler: () => void) => { -
if (event === "end") { -
setImmediate(handler); -
} -
}), -
}),
-
start: vi.fn().mockResolvedValue(createMockExecStream()), inspect: vi.fn().mockResolvedValue({ ExitCode: 0 }), }),};
mockDocker = { getContainer: vi.fn().mockReturnValue(mockContainer),
-
modem: { -
demuxStream: vi.fn((_stream: any, stdout: NodeJS.WritableStream) => { -
stdout.end(); -
}), -
},};
vi.mocked(Docker as unknown as ReturnType).mockImplementation( @@ -76,6 +93,7 @@ describe("DockerDriver.startContainer sidecar bootstrap path", () => { });
afterEach(() => {
-
vi.useRealTimers(); vi.restoreAllMocks(); process.env = { ...originalEnv }; }); @@ -139,19 +157,19 @@ describe("DockerDriver.startContainer sidecar bootstrap path", () => { it("fails loudly when cargo config write exits non-zero", async () => { process.env.CONTAINER_CARGO_REGISTRY = "http://host.docker.internal:8000/api/v1/cratesio/";
- // Cargo config still uses container.exec as "agent" user.
- // Cache repair is now via repairContainerCacheOwnership (mocked above).
- mockContainer.exec.mockResolvedValue({
-
start: vi.fn().mockResolvedValue({ -
resume: vi.fn(), -
on: vi.fn((event: string, handler: () => void) => { -
if (event === "end") { -
setImmediate(handler); -
} -
}), -
}),
- const successfulExec = {
-
start: vi.fn().mockResolvedValue(createMockExecStream()), -
inspect: vi.fn().mockResolvedValue({ ExitCode: 0 }), - };
- const failingExec = {
-
start: vi.fn().mockResolvedValue(createMockExecStream()), inspect: vi.fn().mockResolvedValue({ ExitCode: 23 }),
- });
-
};
-
mockContainer.exec
-
.mockResolvedValueOnce(successfulExec) -
.mockResolvedValueOnce(successfulExec) -
.mockResolvedValueOnce(successfulExec) -
.mockResolvedValueOnce(failingExec);await expect(driver.startContainer("container-id")).rejects.toThrow( "Cargo config write exited with code 23", @@ -167,6 +185,60 @@ describe("DockerDriver.startContainer sidecar bootstrap path", () => { "Cache ownership repair exited with code 19", ); });
-
it("fails loudly when workspace ownership repair never reports an exit code", async () => {
-
process.env.ORCHESTRATOR_TEST_EXEC_EXIT_TIMEOUT_MS = "50";
-
const successfulExec = {
-
start: vi.fn().mockResolvedValue(createMockExecStream()), -
inspect: vi.fn().mockResolvedValue({ ExitCode: 0 }), -
};
-
const hangingChownExec = {
-
start: vi.fn().mockResolvedValue(createMockExecStream()), -
inspect: vi.fn().mockResolvedValue({ ExitCode: undefined }), -
};
-
mockContainer.exec
-
.mockResolvedValueOnce(successfulExec) -
.mockResolvedValueOnce(successfulExec) -
.mockResolvedValueOnce(hangingChownExec); -
const startPromise = driver.startContainer("container-id");
-
await expect(startPromise).rejects.toThrow(
-
"Workspace ownership repair for /home/agent/session-1 did not exit within 50ms", -
);
-
});
-
it("scopes workspace ownership repair to the session workspace when the env root is /home/agent", async () => {
-
await driver.startContainer("container-id");
-
expect(mockContainer.exec).toHaveBeenCalledWith(
-
expect.objectContaining({ -
Cmd: expect.arrayContaining([ -
"/home/agent/session-1", -
"/home/agent/session-1/.sidecar", -
]), -
}), -
);
-
expect(mockContainer.exec).toHaveBeenCalledWith(
-
expect.objectContaining({ -
Cmd: expect.arrayContaining([ -
"chown", -
"agent", -
"/home/agent/session-1", -
]), -
}), -
);
-
const recursiveRepairCall = mockContainer.exec.mock.calls.find(
-
([options]) => -
Array.isArray(options.Cmd) && -
options.Cmd[0] === "sh" && -
typeof options.Cmd[2] === "string" && -
options.Cmd[2].includes("/home/agent/session-1"), -
);
-
expect(recursiveRepairCall).toBeTruthy();
-
expect(recursiveRepairCall?.[0].Cmd[2]).toContain(
-
"workspace='/home/agent/session-1'", -
);
-
}); });
describe("DockerDriver.startContainer startup script execution", () => { @@ -181,19 +253,23 @@ describe("DockerDriver.startContainer startup script execution", () => {
function createMockExec(exitCode = 0) { return {
-
start: vi.fn().mockResolvedValue({ -
destroy: vi.fn(), -
resume: vi.fn(), -
on: vi.fn((event: string, handler: () => void) => { -
if (event === "end") { -
setImmediate(handler); -
} -
}), -
}),
-
start: vi.fn().mockResolvedValue(createMockExecStream()), inspect: vi.fn().mockResolvedValue({ ExitCode: exitCode }),}; }
-
function isWorkspaceBootstrapExec(options: { Cmd?: string[] }): boolean {
-
const cmd = options.Cmd ?? [];
-
return (
-
cmd[0] === "mkdir" || -
cmd[0] === "chown" || -
(cmd[0] === "sh" && -
cmd[1] === "-lc" && -
typeof cmd[2] === "string" && -
cmd[2].includes('find "$workspace"')) -
);
-
}
-
function makeScript(overrides: Partial = {}): StartupScript { return { name: "test-script", @@ -219,6 +295,11 @@ describe("DockerDriver.startContainer startup script execution", () => {
mockDocker = { getContainer: vi.fn().mockReturnValue(mockContainer),
-
modem: { -
demuxStream: vi.fn((_stream: any, stdout: NodeJS.WritableStream) => { -
stdout.end(); -
}), -
},};
vi.mocked(Docker as unknown as ReturnType).mockImplementation( @@ -303,7 +384,11 @@ describe("DockerDriver.startContainer startup script execution", () => { name: "failing-script", continueOnFailure: false, });
- mockContainer.exec.mockResolvedValue(createMockExec(1));
-
mockContainer.exec.mockImplementation((options) =>
-
Promise.resolve( -
createMockExec(isWorkspaceBootstrapExec(options) ? 0 : 1), -
), -
);
await expect( driver.startContainer("ctr-1", { startupScripts: [script] }), @@ -312,7 +397,11 @@ describe("DockerDriver.startContainer startup script execution", () => {
it("continues on non-zero exit when continueOnFailure is true", async () => { const script = makeScript({ name: "soft-fail", continueOnFailure: true });
- mockContainer.exec.mockResolvedValue(createMockExec(42));
-
mockContainer.exec.mockImplementation((options) =>
-
Promise.resolve( -
createMockExec(isWorkspaceBootstrapExec(options) ? 0 : 42), -
), -
);
await expect( driver.startContainer("ctr-1", { startupScripts: [script] }), @@ -366,6 +455,7 @@ describe("DockerDriver.startContainer startup script execution", () => { start: vi.fn().mockResolvedValue({ destroy: destroyFn, resume: vi.fn(),
-
removeListener: vi.fn(), on: vi.fn(), // never fires "end" — simulates a hanging script }), inspect: vi.fn().mockResolvedValue({ ExitCode: 0 }),
@@ -383,9 +473,10 @@ describe("DockerDriver.startContainer startup script execution", () => { const script1 = makeScript({ name: "first" }); const script2 = makeScript({ name: "second" });
- mockContainer.exec.mockImplementation(() => {
-
// Track call order via the mock -
calls.push("exec");
- mockContainer.exec.mockImplementation((options) => {
-
if (!isWorkspaceBootstrapExec(options)) { -
calls.push("exec"); -
});} return Promise.resolve(createMockExec(0));
@@ -393,8 +484,8 @@ describe("DockerDriver.startContainer startup script execution", () => { startupScripts: [script1, script2], });
- // Two exec calls — one per script
- expect(mockContainer.exec).toHaveBeenCalledTimes(2);
- // Two bootstrap execs plus one exec per script.
- expect(mockContainer.exec).toHaveBeenCalledTimes(4); expect(calls).toHaveLength(2); });
diff --git a/apps/orchestrator/tests/unit/orchestrator/project-manager.test.ts b/apps/orchestrator/tests/unit/orchestrator/project-manager.test.ts index ebb40c8..b906dc9 100644 --- a/apps/orchestrator/tests/unit/orchestrator/project-manager.test.ts +++ b/apps/orchestrator/tests/unit/orchestrator/project-manager.test.ts @@ -954,15 +954,6 @@ describe("ProjectManager", () => { findSnapshotProject: vi.fn().mockResolvedValue({ projectRef: "source-project", }),
-
restoreFromSnapshot: vi.fn().mockResolvedValue({ -
id: "snapshot-123", -
snapshotId: "snapshot-123", -
projectRef: "source-project", -
createdAt: new Date("2026-04-14T00:00:00.000Z"), -
tags: [], -
paths: ["/mnt/storage/restore-target"], -
sizeBytes: 0, -
}), restoreOnResume: vi.fn().mockResolvedValue(false), trackProject: vi.fn(), untrackProject: vi.fn(),
@@ -1041,24 +1032,19 @@ describe("ProjectManager", () => { ["source-project"], "snapshot-123", );
- expect(sidecarManager.suspendSidecar).toHaveBeenCalledWith(
-
"container-restore", - );
- expect(snapshotService.restoreFromSnapshot).toHaveBeenCalledWith(
-
"source-project", -
"snapshot-123", -
"host-1", -
"http://host-1:3001", -
"/mnt/storage/restore-target", -
undefined, -
undefined, - );
- expect(sidecarManager.resumeSidecar).toHaveBeenCalledWith(
-
"container-restore", -
expect.any(Number), -
undefined, -
"project_provision",
- expect(sidecarManager.createSidecar).toHaveBeenCalledWith(
-
expect.objectContaining({ -
sessionId: "restore-target", -
fromSnapshot: { -
sourceProjectRef: "source-project", -
snapshotId: "snapshot-123", -
quotaBytes: undefined, -
}, -
);}), - expect(sidecarManager.suspendSidecar).not.toHaveBeenCalled();
- expect(sidecarManager.resumeSidecar).not.toHaveBeenCalled();
- expect(snapshotService.restoreOnResume).not.toHaveBeenCalled(); });
it("fails before provisioning when the source snapshot cannot be resolved", async () => { diff --git a/apps/sidecar/docker/Dockerfile b/apps/sidecar/docker/Dockerfile index 834cd49..5ce8da7 100644 --- a/apps/sidecar/docker/Dockerfile +++ b/apps/sidecar/docker/Dockerfile @@ -197,12 +197,14 @@ RUN rm -f /usr/local/bin/opencode /usr/bin/opencode 2>/dev/null || true && \
-# minimal in-image fallback for the OAuth CLIs users rely on directly; Nix
-# still wins whenever the profile is healthy because /nix/profile/bin is
+# minimal in-image fallback for the CLIs the sidecar must be able to spawn;
+# Nix still wins whenever the profile is healthy because /nix/profile/bin is
ARG CLAUDE_CODE_VERSION=2.1.63 ARG CODEX_VERSION=0.122.0 +ARG OPENCODE_VERSION=1.14.20 RUN npm install -g --ignore-scripts \
- "opencode-ai@${OPENCODE_VERSION}"
"@anthropic-ai/claude-code@${CLAUDE_CODE_VERSION}"
"@openai/codex@${CODEX_VERSION}"
diff --git a/apps/sidecar/src/constants.ts b/apps/sidecar/src/constants.ts index ce528d5..ba42219 100644 --- a/apps/sidecar/src/constants.ts +++ b/apps/sidecar/src/constants.ts @@ -137,15 +137,25 @@ export const FILE_WATCHER_DEFAULTS = {
- as they're long-lived connections that naturally self-limit */ export const RATE_LIMITS = {
- AGENTS_WINDOW_MS: 60_000, // 1 minute
- AGENTS_MAX_REQUESTS: 100, // 100 requests per minute per IP (increased for resilience)
- // Stricter limits for session creation to prevent enumeration attacks
- SESSION_CREATE_WINDOW_MS: 60_000, // 1 minute
- SESSION_CREATE_MAX_REQUESTS: 10, // 10 new sessions per minute per IP
- AGENTS_WINDOW_MS: 60_000,
- AGENTS_MAX_REQUESTS: 100,
- SESSION_CREATE_WINDOW_MS: 60_000,
- SESSION_CREATE_MAX_REQUESTS: 10, FILES_WINDOW_MS: 60_000, FILES_MAX_REQUESTS: 300, TERMINAL_WINDOW_MS: 60_000,
- TERMINAL_MAX_REQUESTS: 50, // Terminal create/exec — lower limit, higher impact ops
- TERMINAL_MAX_REQUESTS: 50, +} as const;
+export const RELAXED_RATE_LIMITS = {
- AGENTS_WINDOW_MS: 60_000,
- AGENTS_MAX_REQUESTS: 10_000,
- SESSION_CREATE_WINDOW_MS: 60_000,
- SESSION_CREATE_MAX_REQUESTS: 1_000,
- FILES_WINDOW_MS: 60_000,
- FILES_MAX_REQUESTS: 10_000,
- TERMINAL_WINDOW_MS: 60_000,
- TERMINAL_MAX_REQUESTS: 10_000, } as const;
/** diff --git a/apps/sidecar/src/index.ts b/apps/sidecar/src/index.ts index f7c86cf..2662e34 100644 --- a/apps/sidecar/src/index.ts +++ b/apps/sidecar/src/index.ts @@ -8,10 +8,15 @@ import { requestId } from "hono/request-id"; import { secureHeaders } from "hono/secure-headers"; import { timeout } from "hono/timeout"; import { rateLimiter } from "hono-rate-limiter"; -import { RATE_LIMITS, TEST_RATE_LIMITS, TIMEOUTS } from "./constants"; +import {
- RATE_LIMITS,
- RELAXED_RATE_LIMITS,
- TEST_RATE_LIMITS,
- TIMEOUTS, +} from "./constants"; import { connectionHealthMonitor } from "./lib/connection-health-monitor"; import { auditLogger } from "./middleware/audit"; -import { authGuard } from "./middleware/auth"; +import { authGuard, isAuthenticatedRequest } from "./middleware/auth"; import { securityCors } from "./middleware/cors"; import { errorHandler } from "./middleware/errors"; import { headerValidators } from "./middleware/headers"; @@ -53,12 +58,21 @@ import toolsMiseRoutes from "./routes/tools"; const getClientIp = (c: Context) => c.req.header("x-forwarded-for") || c.req.header("x-real-ip") || "anonymous"; const isTestEnv = process.env.NODE_ENV === "test" || process.env.CI === "true"; -const fileOpsWindowMs = isTestEnv
- ? TEST_RATE_LIMITS.FILES_WINDOW_MS
- : RATE_LIMITS.FILES_WINDOW_MS; -const fileOpsRequestLimit = isTestEnv
- ? TEST_RATE_LIMITS.FILES_MAX_REQUESTS
- : RATE_LIMITS.FILES_MAX_REQUESTS; +const isLocalDevelopment =
- process.env.NODE_ENV === "development" && process.env.CI !== "true"; +const useRelaxedRateLimits =
- isLocalDevelopment && process.env.SIDECAR_RELAXED_RATE_LIMITS === "true"; +const sidecarVersion =
- process.env.SIDECAR_VERSION ?? process.env.npm_package_version ?? "dev"; +const sidecarImage =
- process.env.SIDECAR_IMAGE_SHA ?? process.env.SIDECAR_IMAGE_TAG ?? "unknown"; +const effectiveRateLimits = isTestEnv
- ? TEST_RATE_LIMITS
- : useRelaxedRateLimits
- ? RELAXED_RATE_LIMITS
- : RATE_LIMITS; +const fileOpsWindowMs = effectiveRateLimits.FILES_WINDOW_MS; +const fileOpsRequestLimit = effectiveRateLimits.FILES_MAX_REQUESTS; const fileOpsRateLimiter = () => rateLimiter({ windowMs: fileOpsWindowMs, @@ -86,6 +100,16 @@ const app = new OpenAPIHono<{ .use("*", metricsMiddleware()) .use(mtlsGuard()) .use(authGuard())
- .use("*", async (c, next) => {
- try {
-
await next(); - } finally {
-
if (isAuthenticatedRequest(c)) { -
c.header("X-Sidecar-Version", sidecarVersion); -
c.header("X-Sidecar-Image", sidecarImage); -
} - }
- }) .use(auditLogger()) .use(securityCors()) .use("", errorHandler()) @@ -93,24 +117,16 @@ const app = new OpenAPIHono<{ .use( "/agents/sessions/", rateLimiter({
-
windowMs: isTestEnv -
? TEST_RATE_LIMITS.AGENTS_WINDOW_MS -
: RATE_LIMITS.AGENTS_WINDOW_MS, -
limit: isTestEnv -
? TEST_RATE_LIMITS.AGENTS_MAX_REQUESTS -
: RATE_LIMITS.AGENTS_MAX_REQUESTS,
-
windowMs: effectiveRateLimits.AGENTS_WINDOW_MS, -
}), ) .use( "/agents/run/*", rateLimiter({limit: effectiveRateLimits.AGENTS_MAX_REQUESTS, keyGenerator: getClientIp,
-
windowMs: isTestEnv -
? TEST_RATE_LIMITS.AGENTS_WINDOW_MS -
: RATE_LIMITS.AGENTS_WINDOW_MS, -
limit: isTestEnv -
? TEST_RATE_LIMITS.AGENTS_MAX_REQUESTS -
: RATE_LIMITS.AGENTS_MAX_REQUESTS,
-
windowMs: effectiveRateLimits.AGENTS_WINDOW_MS, -
}), ) @@ -118,12 +134,8 @@ const app = new OpenAPIHono<{ .use( "/agents/sessions", rateLimiter({limit: effectiveRateLimits.AGENTS_MAX_REQUESTS, keyGenerator: getClientIp,
-
windowMs: isTestEnv -
? TEST_RATE_LIMITS.SESSION_CREATE_WINDOW_MS -
: RATE_LIMITS.SESSION_CREATE_WINDOW_MS, -
limit: isTestEnv -
? TEST_RATE_LIMITS.SESSION_CREATE_MAX_REQUESTS -
: RATE_LIMITS.SESSION_CREATE_MAX_REQUESTS,
-
windowMs: effectiveRateLimits.SESSION_CREATE_WINDOW_MS, -
limit: effectiveRateLimits.SESSION_CREATE_MAX_REQUESTS, keyGenerator: getClientIp, // Only apply to POST requests (session creation) skip: (c) => c.req.method !== "POST",
@@ -137,12 +149,8 @@ const app = new OpenAPIHono<{ .use( "/terminals/*", rateLimiter({
-
windowMs: isTestEnv -
? TEST_RATE_LIMITS.TERMINAL_WINDOW_MS -
: RATE_LIMITS.TERMINAL_WINDOW_MS, -
limit: isTestEnv -
? TEST_RATE_LIMITS.TERMINAL_MAX_REQUESTS -
: RATE_LIMITS.TERMINAL_MAX_REQUESTS,
-
windowMs: effectiveRateLimits.TERMINAL_WINDOW_MS, -
}), ) diff --git a/apps/sidecar/src/lib/token-blocklist.ts b/apps/sidecar/src/lib/token-blocklist.ts index 6a6de57..b740808 100644 --- a/apps/sidecar/src/lib/token-blocklist.ts +++ b/apps/sidecar/src/lib/token-blocklist.ts @@ -15,6 +15,9 @@ interface BlockedEntry { }limit: effectiveRateLimits.TERMINAL_MAX_REQUESTS, keyGenerator: getClientIp,
const blocklist = new Map<string, BlockedEntry>(); +const SHORT_DOCKER_ID_LENGTH = 12; +const FULL_DOCKER_ID_LENGTH = 64; +const HEX_ID_PATTERN = /^[a-f0-9]+$/i;
// Cleanup stale entries every 5 minutes setInterval( @@ -37,7 +40,10 @@ export function revokeToken(jti: string): void { /** Revoke all tokens for a container (by cid prefix in jti). */ export function revokeAllForContainer(containerId: string): void { if (!containerId) return;
- blocklist.set(
prefix:${containerId}, { blockedAt: Date.now() });
- const blockedAt = Date.now();
- for (const cid of getContainerRevocationCandidates(containerId)) {
- blocklist.set(
prefix:${cid}, { blockedAt }); - } }
/** @@ -49,10 +55,36 @@ export function isRevoked(jti: string | undefined, cid?: string): boolean { if (blocklist.has(jti)) return true; // Extract cid from jti if not provided (jti format: "cid:timestamp:random") const effectiveCid = cid || jti.split(":")[0];
- if (effectiveCid && blocklist.has(
prefix:${effectiveCid})) return true;
- if (effectiveCid) {
- for (const candidate of getContainerRevocationCandidates(effectiveCid)) {
-
if (blocklist.has(`prefix:${candidate}`)) return true; - }
- } return false; }
+function getContainerRevocationCandidates(containerId: string): string[] {
- const normalized = containerId.trim();
- if (!normalized) return [];
- const candidates = new Set([normalized]);
- if (
- normalized.length === FULL_DOCKER_ID_LENGTH &&
- HEX_ID_PATTERN.test(normalized)
- ) {
- candidates.add(normalized.slice(0, SHORT_DOCKER_ID_LENGTH));
- }
- if (
- normalized.length === SHORT_DOCKER_ID_LENGTH &&
- HEX_ID_PATTERN.test(normalized)
- ) {
- // Keep the short form; full form is unknown here.
- candidates.add(normalized);
- }
- return [...candidates]; +}
/** Get blocklist size (for metrics). */ export function blocklistSize(): number { return blocklist.size; diff --git a/apps/sidecar/src/middleware/auth.ts b/apps/sidecar/src/middleware/auth.ts index f17b16c..5ba8ed6 100644 --- a/apps/sidecar/src/middleware/auth.ts +++ b/apps/sidecar/src/middleware/auth.ts @@ -1,6 +1,6 @@ -import { verify } from "node:crypto"; import type { IncomingMessage } from "node:http"; import { createLogger } from "@repo/shared"; +import { verifySidecarToken } from "@tangle-network/sdk-core/auth"; import type { MiddlewareHandler } from "hono"; import { config } from "../config"; import { HTTP_STATUS } from "../constants"; @@ -8,6 +8,7 @@ import { isRevoked } from "../lib/token-blocklist";
const logger = createLogger("auth"); const BEARER_PREFIX = "bearer "; +const AUTHENTICATED_REQUEST_FLAG = "sidecarAuthenticated";
/**
- Routes that the in-sandbox agent is allowed to call without presenting a @@ -97,56 +98,6 @@ const extractToken = ( return trimmed; };
-/**
-
- Validate a scoped sidecar JWT (typ: "sidecar").
-
- Ed25519 only — no HMAC fallback. Sidecar holds public key, cannot forge.
-
- Returns decoded payload on success (for audit logging), null on failure.
- */ -function validateSidecarJwt(
- token: string,
- containerId: string,
- publicKey: string, -): { sub: string; pid: string; cid: string; sid?: string } | null {
- try {
- const parts = token.split(".");
- if (parts.length !== 3) return null;
- const headerJson = Buffer.from(parts[0], "base64url").toString("utf-8");
- const header = JSON.parse(headerJson);
- if (header.alg !== "EdDSA") return null;
- // SECURITY: verify signature BEFORE parsing claims to prevent timing oracle.
- // Claims checked after signature = attacker can't probe valid container IDs.
- const data =
${parts[0]}.${parts[1]}; - const signatureBuffer = Buffer.from(parts[2], "base64url");
- if (!verify(null, Buffer.from(data), publicKey, signatureBuffer)) {
-
return null; - }
- // Signature verified — safe to inspect claims
- const payload = JSON.parse(
-
Buffer.from(parts[1], "base64url").toString("utf-8"), - );
- if (payload.typ !== "sidecar") return null;
- if (payload.cid !== containerId) return null;
- if (
-
typeof payload.exp !== "number" || -
payload.exp < Math.floor(Date.now() / 1000) - )
-
return null; - if (typeof payload.sub !== "string" || payload.sub.length === 0)
-
return null; - // Check token blocklist (revoked on sandbox deletion)
- if (isRevoked(payload.jti, payload.cid)) return null;
- return payload;
- } catch {
- return null;
- } -}
export const authGuard = (): MiddlewareHandler => { const authConfig = config.auth; const tokenSet = new Set(authConfig.tokens); @@ -209,11 +160,38 @@ export const authGuard = (): MiddlewareHandler => {
// JWT validation: scoped sidecar access tokens (typ: "sidecar")
if (token.startsWith("ey") && verifyKey) {
-
const jwtPayload = validateSidecarJwt(token, containerId, verifyKey);
-
const jwtPayload = verifySidecarToken(token, verifyKey, containerId); if (jwtPayload) { -
if (!jwtPayload.jti) { -
logger.warn("JWT sidecar token missing jti", { path, containerId }); -
return c.json( -
{ -
success: false as const, -
error: { -
code: "FORBIDDEN", -
message: "Invalid authentication token", -
}, -
}, -
HTTP_STATUS.FORBIDDEN, -
); -
} -
if (isRevoked(jwtPayload.jti, jwtPayload.cid)) { -
logger.warn("Revoked JWT sidecar token", { path, containerId }); -
return c.json( -
{ -
success: false as const, -
error: { -
code: "FORBIDDEN", -
message: "Invalid authentication token", -
}, -
}, -
HTTP_STATUS.FORBIDDEN, -
); -
} c.set("userId", jwtPayload.sub); c.set("containerId", jwtPayload.cid); c.set("sessionId", jwtPayload.sid || ""); -
c.set(AUTHENTICATED_REQUEST_FLAG, true); return next(); } // JWT failed — fall through to static token for orchestrator internal calls
@@ -234,6 +212,13 @@ export const authGuard = (): MiddlewareHandler => { ); }
- c.set(AUTHENTICATED_REQUEST_FLAG, true); return next(); }; };
+export const isAuthenticatedRequest = (c: {
- get: (key: string) => unknown; +}): boolean => {
- return c.get(AUTHENTICATED_REQUEST_FLAG) === true; +}; diff --git a/apps/sidecar/src/routes/debug.ts b/apps/sidecar/src/routes/debug.ts index fe5ba2d..91c3c9b 100644 --- a/apps/sidecar/src/routes/debug.ts +++ b/apps/sidecar/src/routes/debug.ts @@ -19,7 +19,12 @@ import type { RequestIdVariables } from "hono/request-id"; import { z } from "zod"; import { sessionStore } from "../agents/session-store.js"; import { backendManager } from "../backends/backend-manager.js"; -import { HTTP_STATUS } from "../constants.js"; +import {
- HTTP_STATUS,
- RATE_LIMITS,
- RELAXED_RATE_LIMITS,
- TEST_RATE_LIMITS, +} from "../constants.js"; import { portWatcher } from "../process-monitor";
const logger = createLogger("debug"); @@ -299,6 +304,46 @@ const debugProcessesRoute = createRoute({ }, });
+const debugConfigRoute = createRoute({
- method: "get",
- path: "/config",
- summary: "Get debug configuration",
- description:
- "Returns sidecar runtime identity and effective rate limit settings",
- responses: {
- 200: {
-
description: "Debug configuration", -
content: { -
"application/json": { -
schema: z.object({ -
version: z.string(), -
image: z.string(), -
nodeEnv: z.string(), -
rateLimits: z.object({ -
files: z.object({ -
windowMs: z.number(), -
maxRequests: z.number(), -
}), -
agents: z.object({ -
windowMs: z.number(), -
maxRequests: z.number(), -
}), -
sessionCreate: z.object({ -
windowMs: z.number(), -
maxRequests: z.number(), -
}), -
terminals: z.object({ -
windowMs: z.number(), -
maxRequests: z.number(), -
}), -
}), -
}), -
}, -
}, - },
- }, +});
// Route handlers const debugRoutes = new OpenAPIHono<{ Variables: RequestIdVariables; @@ -430,4 +475,45 @@ debugRoutes.openapi(debugProcessesRoute, async (c) => { } });
+debugRoutes.openapi(debugConfigRoute, async (c) => {
- const isTestEnv =
- process.env.NODE_ENV === "test" || process.env.CI === "true";
- const isLocalDevelopment =
- process.env.NODE_ENV === "development" && process.env.CI !== "true";
- const useRelaxedRateLimits =
- isLocalDevelopment && process.env.SIDECAR_RELAXED_RATE_LIMITS === "true";
- const limits = isTestEnv
- ? TEST_RATE_LIMITS
- : useRelaxedRateLimits
-
? RELAXED_RATE_LIMITS -
: RATE_LIMITS; - return c.json({
- version:
-
process.env.SIDECAR_VERSION ?? process.env.npm_package_version ?? "dev", - image:
-
process.env.SIDECAR_IMAGE_SHA ?? -
process.env.SIDECAR_IMAGE_TAG ?? -
"unknown", - nodeEnv: process.env.NODE_ENV ?? "development",
- rateLimits: {
-
files: { -
windowMs: limits.FILES_WINDOW_MS, -
maxRequests: limits.FILES_MAX_REQUESTS, -
}, -
agents: { -
windowMs: limits.AGENTS_WINDOW_MS, -
maxRequests: limits.AGENTS_MAX_REQUESTS, -
}, -
sessionCreate: { -
windowMs: limits.SESSION_CREATE_WINDOW_MS, -
maxRequests: limits.SESSION_CREATE_MAX_REQUESTS, -
}, -
terminals: { -
windowMs: limits.TERMINAL_WINDOW_MS, -
maxRequests: limits.TERMINAL_MAX_REQUESTS, -
}, - },
- }); +});
export default debugRoutes; diff --git a/apps/sidecar/tests/unit/debug-route.test.ts b/apps/sidecar/tests/unit/debug-route.test.ts new file mode 100644 index 0000000..3543b0c --- /dev/null +++ b/apps/sidecar/tests/unit/debug-route.test.ts @@ -0,0 +1,59 @@ +import { afterEach, beforeEach, describe, expect, it } from "vitest"; + +describe("debug config route", () => {
- const originalEnv = {
- NODE_ENV: process.env.NODE_ENV,
- CI: process.env.CI,
- SIDECAR_DEBUG_ENABLED: process.env.SIDECAR_DEBUG_ENABLED,
- SIDECAR_RELAXED_RATE_LIMITS: process.env.SIDECAR_RELAXED_RATE_LIMITS,
- SIDECAR_VERSION: process.env.SIDECAR_VERSION,
- SIDECAR_IMAGE_SHA: process.env.SIDECAR_IMAGE_SHA,
- SIDECAR_IMAGE_TAG: process.env.SIDECAR_IMAGE_TAG,
- };
- beforeEach(() => {
- process.env.NODE_ENV = "test";
- process.env.CI = "true";
- process.env.SIDECAR_DEBUG_ENABLED = "true";
- process.env.SIDECAR_VERSION = "9.9.9";
- process.env.SIDECAR_IMAGE_SHA = "sha256:test";
- });
- afterEach(() => {
- for (const [key, value] of Object.entries(originalEnv)) {
-
if (value === undefined) { -
delete process.env[key]; -
} else { -
process.env[key] = value; -
} - }
- });
- it("returns effective rate limits and runtime identity", async () => {
- const { default: debugRoutes } = await import("../../src/routes/debug");
- const res = await debugRoutes.request("/config");
- expect(res.status).toBe(200);
- expect(res.headers.get("content-type")).toContain("application/json");
- const body = await res.json();
- expect(body.version).toBe("9.9.9");
- expect(body.image).toBe("sha256:test");
- expect(body.rateLimits.files.maxRequests).toBeGreaterThan(0);
- expect(body.rateLimits.terminals.windowMs).toBeGreaterThan(0);
- });
- it("does not enable relaxed limits outside local development", async () => {
- process.env.NODE_ENV = "production";
- process.env.CI = "false";
- process.env.SIDECAR_RELAXED_RATE_LIMITS = "true";
- const { default: debugRoutes } = await import("../../src/routes/debug");
- const res = await debugRoutes.request("/config");
- const body = await res.json();
- expect(body.rateLimits.sessionCreate.maxRequests).toBe(10);
- expect(body.rateLimits.files.maxRequests).toBe(300);
- expect(body.rateLimits.terminals.maxRequests).toBe(50);
- }); +}); diff --git a/apps/sidecar/tests/unit/identity-headers.test.ts b/apps/sidecar/tests/unit/identity-headers.test.ts new file mode 100644 index 0000000..a324cd9 --- /dev/null +++ b/apps/sidecar/tests/unit/identity-headers.test.ts @@ -0,0 +1,98 @@ +import { Hono } from "hono"; +import { afterEach, beforeEach, describe, expect, it, vi } from "vitest";
+describe("sidecar identity headers", () => {
-
const originalEnv = {
-
SIDECAR_AUTH_TOKEN: process.env.SIDECAR_AUTH_TOKEN,
-
SIDECAR_VERSION: process.env.SIDECAR_VERSION,
-
SIDECAR_IMAGE_SHA: process.env.SIDECAR_IMAGE_SHA,
-
};
-
beforeEach(() => {
-
vi.resetModules();
-
process.env.SIDECAR_AUTH_TOKEN = "test-auth-token";
-
process.env.SIDECAR_VERSION = "1.2.3";
-
process.env.SIDECAR_IMAGE_SHA = "sha256:test-image";
-
delete process.env.SIDECAR_AUTH_DISABLED;
-
});
-
afterEach(() => {
-
vi.resetModules();
-
for (const [key, value] of Object.entries(originalEnv)) {
-
if (value === undefined) { -
delete process.env[key]; -
} else { -
process.env[key] = value; -
} -
}
-
delete process.env.SIDECAR_AUTH_DISABLED;
-
});
-
const createApp = async () => {
-
const { authGuard, isAuthenticatedRequest } = await import(
-
"@/middleware/auth" -
);
-
const sidecarVersion =
-
process.env.SIDECAR_VERSION ?? process.env.npm_package_version ?? "dev"; -
const sidecarImage =
-
process.env.SIDECAR_IMAGE_SHA ?? -
process.env.SIDECAR_IMAGE_TAG ?? -
"unknown"; -
const app = new Hono();
-
app.use("*", authGuard());
-
app.use("*", async (c, next) => {
-
try { -
await next(); -
} finally { -
if (isAuthenticatedRequest(c)) { -
c.header("X-Sidecar-Version", sidecarVersion); -
c.header("X-Sidecar-Image", sidecarImage); -
} -
} -
});
-
app.get("/secure", (c) => c.json({ ok: true }));
-
app.get("/metrics", (c) => c.text("ok"));
-
return app;
-
};
-
it("does not expose identity headers on unauthorized responses", async () => {
-
const app = await createApp();
-
const res = await app.request("/secure");
-
expect(res.status).toBe(401);
-
expect(res.headers.get("X-Sidecar-Version")).toBeNull();
-
expect(res.headers.get("X-Sidecar-Image")).toBeNull();
-
});
-
it("exposes identity headers after successful authentication", async () => {
-
const app = await createApp();
-
const res = await app.request("/secure", {
-
headers: { -
Authorization: "Bearer test-auth-token", -
}, -
});
-
expect(res.status).toBe(200);
-
expect(res.headers.get("X-Sidecar-Version")).toBe("1.2.3");
-
expect(res.headers.get("X-Sidecar-Image")).toBe("sha256:test-image");
-
});
-
it("does not expose identity headers on auth-skipped paths", async () => {
-
const app = await createApp();
-
const res = await app.request("/metrics");
-
expect(res.status).toBe(200);
-
expect(res.headers.get("X-Sidecar-Version")).toBeNull();
-
expect(res.headers.get("X-Sidecar-Image")).toBeNull();
-
});
-
it("does not expose identity headers on unauthenticated OPTIONS requests", async () => {
-
const app = await createApp();
-
const res = await app.request("/secure", { method: "OPTIONS" });
-
expect(res.headers.get("X-Sidecar-Version")).toBeNull();
-
expect(res.headers.get("X-Sidecar-Image")).toBeNull();
-
}); +}); diff --git a/apps/sidecar/tests/unit/lib/token-blocklist.test.ts b/apps/sidecar/tests/unit/lib/token-blocklist.test.ts index 659bfff..716813e 100644 --- a/apps/sidecar/tests/unit/lib/token-blocklist.test.ts +++ b/apps/sidecar/tests/unit/lib/token-blocklist.test.ts @@ -37,6 +37,17 @@ describe("Token Blocklist", () => { ); });
-
it("treats Docker short and full container IDs as the same revocation scope", () => {
-
const fullCid =
-
"0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"; -
const shortCid = fullCid.slice(0, 12);
-
revokeAllForContainer(shortCid);
-
expect(isRevoked(
${fullCid}:12345:abcdef, fullCid)).toBe(true); -
expect(isRevoked(
${shortCid}:12345:abcdef, shortCid)).toBe(true); -
});
-
it("blocklistSize tracks entries", () => { const before = blocklistSize(); const jti =
size-test-${Date.now()}; diff --git a/apps/sidecar/tests/unit/middleware/auth-jwt.test.ts b/apps/sidecar/tests/unit/middleware/auth-jwt.test.ts index 4af8c26..eb9a657 100644 --- a/apps/sidecar/tests/unit/middleware/auth-jwt.test.ts +++ b/apps/sidecar/tests/unit/middleware/auth-jwt.test.ts @@ -3,6 +3,8 @@- No mocks. Generates actual key pairs, signs real JWTs, verifies through
- the full middleware stack. */
+import { sign } from "node:crypto"; import { Hono } from "hono"; import { afterAll, beforeAll, describe, expect, it } from "vitest"; // Import from source path — test environment doesn't have dist built @@ -23,11 +25,13 @@ beforeAll(() => { // Set env vars BEFORE importing the middleware (it reads them at module load) process.env.JWT_VERIFY_KEY = keyPair.publicKey; process.env.HOSTNAME = CONTAINER_ID;
- process.env.CONTAINER_ID = "attacker-controlled-env"; process.env.SIDECAR_AUTH_TOKEN = "static-orchestrator-token"; process.env.SIDECAR_AUTH_DISABLED = "false"; });
afterAll(() => {
-
delete process.env.CONTAINER_ID; delete process.env.JWT_VERIFY_KEY; delete process.env.HOSTNAME; }); @@ -172,4 +176,32 @@ describe("Auth Middleware — Ed25519 JWT", () => { }); expect(res.status).toBe(403); });
-
it("rejects JWT without a jti claim", async () => {
-
const header = Buffer.from(
-
JSON.stringify({ alg: "EdDSA", typ: "JWT" }), -
).toString("base64url");
-
const payload = Buffer.from(
-
JSON.stringify({ -
sub: "user", -
pid: "p", -
cid: CONTAINER_ID, -
typ: "sidecar", -
iat: Math.floor(Date.now() / 1000), -
exp: Math.floor(Date.now() / 1000) + 300, -
}), -
).toString("base64url");
-
const signature = sign(
-
null, -
Buffer.from(`${header}.${payload}`), -
keyPair.privateKey, -
).toString("base64url");
-
const token =
${header}.${payload}.${signature}; -
const res = await app.request("/test", {
-
headers: { Authorization: `Bearer ${token}` }, -
});
-
expect(res.status).toBe(403);
-
}); }); diff --git a/packages/sdk-core/src/auth/tokens.ts b/packages/sdk-core/src/auth/tokens.ts index 033868c..e006391 100644 --- a/packages/sdk-core/src/auth/tokens.ts +++ b/packages/sdk-core/src/auth/tokens.ts @@ -295,7 +295,12 @@ export function verifySidecarToken( );
if (payload.typ !== "sidecar") return null;
- if (payload.cid !== containerId) return null;
-
if (typeof payload.jti !== "string" || payload.jti.length === 0) {
-
return null; -
}
-
const exactContainerMatch = payload.cid === containerId;
-
const shortDockerIdMatch = isDockerShortIdMatch(payload.cid, containerId);
-
if (!exactContainerMatch && !shortDockerIdMatch) return null;
// Require exp — tokens without expiration are rejected (no immortal tokens) const now = Math.floor(Date.now() / 1000); @@ -311,6 +316,22 @@ export function verifySidecarToken( } }
+function isDockerShortIdMatch(
- payloadCid: unknown,
- containerId: unknown, +): boolean {
- if (typeof payloadCid !== "string" || typeof containerId !== "string") {
- return false;
- }
- if (!/^[a-f0-9]{64}$/i.test(payloadCid)) {
- return false;
- }
- if (!/^[a-f0-9]{12}$/i.test(containerId)) {
- return false;
- }
- return payloadCid.startsWith(containerId); +}
/**
- Check if a token payload is session-scoped.
- Session-scoped tokens have a sid claim and no projectId/projectIds. diff --git a/packages/sdk-core/tests/auth/sidecar-tokens.test.ts b/packages/sdk-core/tests/auth/sidecar-tokens.test.ts index c1ac8c4..3a79d39 100644 --- a/packages/sdk-core/tests/auth/sidecar-tokens.test.ts +++ b/packages/sdk-core/tests/auth/sidecar-tokens.test.ts @@ -1,3 +1,4 @@ +import { sign } from "node:crypto"; import { describe, expect, it } from "vitest"; import { generateSidecarKeyPair, @@ -7,10 +8,12 @@ import {
describe("Ed25519 Sidecar Token Auth", () => { const keyPair = generateSidecarKeyPair();
- const fullContainerId =
- "0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"; const payload = { sub: "user-123", pid: "product-456",
- cid: "container-abc",
- cid: fullContainerId, sid: "session-xyz", };
@@ -48,7 +51,7 @@ describe("Ed25519 Sidecar Token Auth", () => { const claims = JSON.parse(Buffer.from(parts[1], "base64url").toString()); expect(claims.sub).toBe("user-123"); expect(claims.pid).toBe("product-456");
-
expect(claims.cid).toBe("container-abc");
-
expect(claims.cid).toBe(fullContainerId); expect(claims.sid).toBe("session-xyz"); expect(claims.typ).toBe("sidecar"); expect(claims.iat).toBeTypeOf("number");
@@ -75,12 +78,12 @@ describe("Ed25519 Sidecar Token Auth", () => { const result = verifySidecarToken( token, keyPair.publicKey,
-
"container-abc",
-
fullContainerId, ); expect(result).not.toBeNull(); if (!result) throw new Error("unreachable"); expect(result.sub).toBe("user-123");
-
expect(result.cid).toBe("container-abc");
-
expect(result.cid).toBe(fullContainerId);});
it("rejects token signed with different key", () => { @@ -89,7 +92,7 @@ describe("Ed25519 Sidecar Token Auth", () => { const result = verifySidecarToken( token, keyPair.publicKey,
-
"container-abc",
-
fullContainerId, ); expect(result).toBeNull();}); @@ -115,6 +118,37 @@ describe("Ed25519 Sidecar Token Auth", () => { expect(result).toBeNull(); });
-
it("accepts Docker short-hostname container IDs as a prefix match", () => {
-
const token = issueSidecarAccessToken(keyPair.privateKey, payload, 5); -
const shortDockerId = payload.cid.slice(0, 12); -
const result = verifySidecarToken( -
token, -
keyPair.publicKey, -
shortDockerId, -
); -
expect(result).not.toBeNull(); -
if (!result) throw new Error("unreachable"); -
expect(result.cid).toBe(fullContainerId); -
});
-
it("rejects arbitrary string prefix matches", () => {
-
const prefixedPayload = { -
...payload, -
cid: "container-abc-malicious", -
}; -
const token = issueSidecarAccessToken( -
keyPair.privateKey, -
prefixedPayload, -
5, -
); -
const result = verifySidecarToken( -
token, -
keyPair.publicKey, -
"container-abc", -
); -
expect(result).toBeNull(); -
});
-
it("rejects HS256 downgrade attack", () => { // Forge an HMAC token with alg: "HS256" — must be rejected const header = Buffer.from( @@ -127,7 +161,7 @@ describe("Ed25519 Sidecar Token Auth", () => { const result = verifySidecarToken( fakeToken, keyPair.publicKey,
-
"container-abc",
-
}); @@ -145,7 +179,7 @@ describe("Ed25519 Sidecar Token Auth", () => { const result = verifySidecarToken( tamperedToken, keyPair.publicKey,fullContainerId, ); expect(result).toBeNull();
-
"container-abc",
-
fullContainerId, ); // Tampered payload = signature mismatch → rejected expect(result).toBeNull();
@@ -161,7 +195,7 @@ describe("Ed25519 Sidecar Token Auth", () => { });
it("rejects token with missing sub claim", () => {
-
const noSub = { pid: "p", cid: "container-abc" };
-
const noSub = { pid: "p", cid: payload.cid }; const token = issueSidecarAccessToken( keyPair.privateKey, { ...noSub, sub: "" },
@@ -170,7 +204,7 @@ describe("Ed25519 Sidecar Token Auth", () => { const result = verifySidecarToken( token, keyPair.publicKey,
-
"container-abc",
-
}); @@ -178,11 +212,11 @@ describe("Ed25519 Sidecar Token Auth", () => { it("rejects token with empty cid", () => { const emptyCid = { sub: "user", pid: "p", cid: "" }; const token = issueSidecarAccessToken(keyPair.privateKey, emptyCid, 5);fullContainerId, ); expect(result).toBeNull();
-
// containerId is "container-abc" but token has cid: "" → mismatch
-
// containerId is the real container ID but token has cid: "" → mismatch const result = verifySidecarToken( token, keyPair.publicKey,
-
"container-abc",
-
}); @@ -203,7 +237,7 @@ describe("Ed25519 Sidecar Token Auth", () => { const result = verifySidecarToken( corruptToken, keyPair.publicKey,fullContainerId, ); expect(result).toBeNull();
-
"container-abc",
-
}); @@ -213,20 +247,44 @@ describe("Ed25519 Sidecar Token Auth", () => { const result = verifySidecarToken( token, keyPair.publicKey,fullContainerId, ); expect(result).toBeNull();
-
"container-abc",
-
fullContainerId, ); expect(result).not.toBeNull(); if (!result) throw new Error("unreachable"); expect(result.jti).toBeTypeOf("string"); expect(result.jti?.length).toBeGreaterThan(10);
-
expect(result.jti).toContain("container-abc:");
-
expect(result.jti).toContain(`${fullContainerId}:`); -
});
-
it("rejects tokens without a jti claim", () => {
-
const token = issueSidecarAccessToken(keyPair.privateKey, payload, 5); -
const [header, encodedClaims] = token.split("."); -
const claims = JSON.parse( -
Buffer.from(encodedClaims, "base64url").toString("utf-8"), -
); -
delete claims.jti; -
const tamperedPayload = Buffer.from(JSON.stringify(claims)).toString( -
"base64url", -
); -
const data = `${header}.${tamperedPayload}`; -
const signature = Buffer.from( -
sign(null, Buffer.from(data), keyPair.privateKey), -
).toString("base64url"); -
const rebuiltToken = `${data}.${signature}`; -
const result = verifySidecarToken( -
rebuiltToken, -
keyPair.publicKey, -
fullContainerId, -
); -
expect(result).toBeNull();});
it("generates unique jti per token", () => { const t1 = issueSidecarAccessToken(keyPair.privateKey, payload, 5); const t2 = issueSidecarAccessToken(keyPair.privateKey, payload, 5);
-
const r1 = verifySidecarToken(t1, keyPair.publicKey, "container-abc"); -
const r2 = verifySidecarToken(t2, keyPair.publicKey, "container-abc");
-
const r1 = verifySidecarToken(t1, keyPair.publicKey, fullContainerId); -
const r2 = verifySidecarToken(t2, keyPair.publicKey, fullContainerId); expect(r1).not.toBeNull(); expect(r2).not.toBeNull(); if (!r1 || !r2) throw new Error("unreachable");
diff --git a/packages/sdk-provider-opencode/src/server.ts b/packages/sdk-provider-opencode/src/server.ts
index e3ba9c4..7a15cea 100644
--- a/packages/sdk-provider-opencode/src/server.ts
+++ b/packages/sdk-provider-opencode/src/server.ts
@@ -124,6 +124,51 @@ function limitText(text: string, maxLength = 4000): string {
return ${text.slice(0, maxLength)}...(truncated);
}
+function resolveOpencodeCandidates(env: Record<string, string>): string[] {
- const candidates: string[] = [];
- const seen = new Set();
- const pathDirs = env.PATH?.split(":") ?? [];
- for (const dir of pathDirs) {
- const candidate = join(dir, "opencode");
- if (!existsSync(candidate) || seen.has(candidate)) {
-
continue; - }
- seen.add(candidate);
- candidates.push(candidate);
- }
- try {
- const resolved = execSync("which opencode", {
-
env, -
encoding: "utf-8", - }).trim();
- if (resolved && !seen.has(resolved)) {
-
seen.add(resolved); -
candidates.push(resolved); - }
- } catch {
- // Fall through to the bare command if PATH resolution fails.
- }
- if (!seen.has("opencode")) {
- candidates.push("opencode");
- }
- return candidates; +}
+function isRetryableSpawnPermissionError(error: unknown): boolean {
- if (!(error instanceof Error)) {
- return false;
- }
- const code = (error as NodeJS.ErrnoException).code;
- return (
- code === "EACCES" || /\bEACCES\b|permission denied/i.test(error.message)
- ); +}
async function fetchGlobalConfigSnapshot(baseUrl: string): Promise<{ status: number; body: unknown; @@ -230,17 +275,22 @@ export async function createOpencodeServer( const nixBinPaths = process.env.NIX_BIN_PATH?.split(":").filter(Boolean) ?? [];
- const stableSystemBins = [
- dirname(process.execPath),
- "/usr/local/bin",
- "/usr/bin",
- "/bin",
- "/usr/sbin",
- "/sbin",
- ];
- const pathDirs = [
- ...stableSystemBins, ...nixBinPaths, // Nix-provided tools (git, python, gcc, etc.) SIDECAR_NODE_MODULES_BIN, // sidecar's node_modules/.bin for MCP binaries installed as sidecar deps LIBRARIAN_NODE_MODULES_BIN, // librarian package's own node_modules/.bin ...nodeModulesBinPaths, // workspace node_modules/.bin paths process.env.PATH,
-
"/usr/local/bin",
-
"/usr/bin",
-
"/bin",
-
"/usr/sbin",
-
"/sbin",
${process.env.HOME}/.opencode/bin, "/opt/homebrew/bin", ].filter(Boolean); @@ -346,31 +396,7 @@ export async function createOpencodeServer( args.push("--print-logs"); } -
// Resolve the opencode binary path so we can spawn without shell: true.
-
// shell: true adds ~50ms overhead from bash process + PATH scanning.
-
let opencodeBin = "opencode";
-
const pathDirsArr = spawnEnv.PATH?.split(":") ?? [];
-
for (const dir of pathDirsArr) {
-
const candidate = join(dir, "opencode");
-
if (existsSync(candidate)) {
-
opencodeBin = candidate; -
break; -
}
-
}
-
if (opencodeBin === "opencode") {
-
// Fallback: use
whichto find it -
try {
-
opencodeBin = execSync("which opencode", { -
env: spawnEnv, -
encoding: "utf-8", -
}).trim(); -
} catch {
-
// Last resort: keep bare name, spawn will fail with ENOENT if not found -
}
-
}
-
// Spawn with detached: true so we can kill the entire process group
-
const proc = spawn(opencodeBin, args, {
- const spawnOptions = { cwd, shell: false, detached: true, @@ -382,7 +408,114 @@ export async function createOpencodeServer( gid: subprocessIdentity.gid, } : {}),
- });
-
};
-
const startServer = async (
-
opencodeBin: string,
-
): Promise<{ proc: ChildProcess; url: string }> => {
-
const proc = spawn(opencodeBin, args, spawnOptions);
-
const url = await new Promise((resolve, reject) => {
-
const id = setTimeout(() => { -
reject( -
new Error(`Timeout waiting for server to start after ${timeout}ms`), -
); -
}, timeout); -
let output = ""; -
proc.stdout?.on("data", (chunk) => { -
const stdout = chunk.toString(); -
output += stdout; -
if (printLogs) { -
process.stdout.write(stdout); -
} -
writeLog("stdout", stdout); -
const lines = output.split("\n"); -
for (const line of lines) { -
if (!line.startsWith("opencode server listening")) { -
continue; -
} -
const match = line.match(/on\s+(https?:\/\/[^\s]+)/); -
if (!match?.[1]) { -
clearTimeout(id); -
reject( -
new Error(`Failed to parse server url from output: ${line}`), -
); -
return; -
} -
clearTimeout(id); -
resolve(match[1]); -
return; -
} -
}); -
proc.stderr?.on("data", (chunk) => { -
const stderr = chunk.toString(); -
output += stderr; -
if (printLogs) { -
process.stderr.write(stderr); -
} -
writeLog("stderr", stderr); -
}); -
proc.on("exit", (code) => { -
clearTimeout(id); -
let msg = `Server exited with code ${code}`; -
if (output.trim()) { -
msg += `\nServer output: ${output}`; -
} -
reject(new Error(msg)); -
}); -
proc.on("error", (error) => { -
clearTimeout(id); -
reject(error); -
}); -
if (options?.signal) { -
options.signal.addEventListener("abort", () => { -
clearTimeout(id); -
reject(new Error("Aborted")); -
}); -
} -
});
-
return { proc, url };
-
};
-
let proc: ChildProcess | null = null;
-
let url: string | null = null;
-
let lastSpawnError: unknown = null;
-
const attemptedCandidates: string[] = [];
-
const opencodeCandidates = resolveOpencodeCandidates(spawnEnv);
-
for (const candidate of opencodeCandidates) {
-
attemptedCandidates.push(candidate);
-
try {
-
const started = await startServer(candidate); -
proc = started.proc; -
url = started.url; -
break; -
} catch (error) {
-
lastSpawnError = error; -
if (!isRetryableSpawnPermissionError(error)) { -
throw error; -
} -
}
-
}
-
if (!proc || !url) {
-
logStream?.end();
-
if (lastSpawnError instanceof Error) {
-
throw new Error( -
`${lastSpawnError.message} (attempted opencode candidates: ${attemptedCandidates.join(", ")}; PATH=${spawnEnv.PATH ?? ""})`, -
); -
}
-
throw new Error(
-
`Failed to spawn opencode (attempted opencode candidates: ${attemptedCandidates.join(", ")}; PATH=${spawnEnv.PATH ?? ""})`, -
);
-
}
// IMPORTANT: Do NOT call proc.unref() here. // With detached:true + proc.unref(), the opencode child outlives the @@ -500,88 +633,6 @@ export async function createOpencodeServer( process.once("SIGTERM", cleanupOnSignal); process.once("SIGINT", cleanupOnSignal);
- const url = await new Promise((resolve, reject) => {
- const id = setTimeout(() => {
-
reject( -
new Error(`Timeout waiting for server to start after ${timeout}ms`), -
); - }, timeout);
- let output = "";
- proc.stdout?.on("data", (chunk) => {
-
const stdout = chunk.toString(); -
output += stdout; -
if (printLogs) { -
process.stdout.write(stdout); -
} -
writeLog("stdout", stdout); -
const lines = output.split("\n"); -
for (const line of lines) { -
if (line.startsWith("opencode server listening")) { -
const match = line.match(/on\s+(https?:\/\/[^\s]+)/); -
if (!match) { -
clearTimeout(id); -
reject( -
new Error(`Failed to parse server url from output: ${line}`), -
); -
return; -
} -
const url = match[1]; -
if (!url) { -
clearTimeout(id); -
reject( -
new Error(`Failed to parse server url from output: ${line}`), -
); -
return; -
} -
clearTimeout(id); -
resolve(url); -
return; -
} -
} - });
- proc.stderr?.on("data", (chunk) => {
-
const stderr = chunk.toString(); -
output += stderr; -
if (printLogs) { -
process.stderr.write(stderr); -
} -
writeLog("stderr", stderr); - });
- proc.on("exit", (code) => {
-
clearTimeout(id); -
let msg = `Server exited with code ${code}`; -
if (output.trim()) { -
msg += `\nServer output: ${output}`; -
} -
if (logStream) { -
logStream.end(); -
} -
reject(new Error(msg)); - });
- proc.on("error", (error) => {
-
clearTimeout(id); -
if (logStream) { -
logStream.end(); -
} -
reject(error); - });
- if (options?.signal) {
-
options.signal.addEventListener("abort", () => { -
clearTimeout(id); -
if (logStream) { -
logStream.end(); -
} -
reject(new Error("Aborted")); -
}); - }
- });
- if (logGlobalConfig) { try { const snapshot = await fetchGlobalConfigSnapshot(url); diff --git a/packages/sdk-provider-opencode/tests/server-process-user.test.ts b/packages/sdk-provider-opencode/tests/server-process-user.test.ts index 1a6627b..26da2ab 100644 --- a/packages/sdk-provider-opencode/tests/server-process-user.test.ts +++ b/packages/sdk-provider-opencode/tests/server-process-user.test.ts @@ -1,4 +1,5 @@ import { EventEmitter } from "node:events"; +import { existsSync } from "node:fs"; import { PassThrough } from "node:stream"; import { afterEach, beforeEach, describe, expect, it, vi } from "vitest";
@@ -81,6 +82,7 @@ describe("createOpencodeServer subprocess identity", () => { mockSpawn.mockReset(); mockMkdir.mockClear(); mockChown.mockClear();
- vi.mocked(existsSync).mockImplementation(() => false); });
afterEach(() => { @@ -172,4 +174,70 @@ describe("createOpencodeServer subprocess identity", () => { // chown must not be called when not dropping privileges expect(mockChown).not.toHaveBeenCalled(); }); +
- it("falls back to the next binary when a preferred candidate hits EACCES", async () => {
- process.getuid = vi.fn(() => 0);
- process.env.AGENT_SUBPROCESS_UID = "1000";
- process.env.AGENT_SUBPROCESS_GID = "1000";
- process.env.NIX_BIN_PATH = "/nix/profile/bin";
- process.env.PATH = ["/usr/local/bin", "/usr/bin", "/bin"].join(":");
- vi.mocked(existsSync).mockImplementation((path) => {
-
const candidate = String(path); -
return ( -
candidate === "/nix/profile/bin/opencode" || -
candidate === "/usr/local/bin/opencode" -
); - });
- const permissionDenied = new EventEmitter() as MockChildProcess;
- permissionDenied.stdout = new PassThrough();
- permissionDenied.stderr = new PassThrough();
- permissionDenied.unref = vi.fn();
- permissionDenied.pid = undefined as unknown as number;
- setTimeout(() => {
-
const error = Object.assign( -
new Error("spawn /usr/local/bin/opencode EACCES"), -
{ -
code: "EACCES", -
}, -
); -
permissionDenied.emit("error", error); - }, 0);
- mockSpawn
-
.mockReturnValueOnce(permissionDenied) -
.mockReturnValueOnce(createMockChildProcess()); - await expect(
-
createOpencodeServer({ -
hostname: "127.0.0.1", -
port: 4097, -
sessionId: "session-1", -
cwd: "/tmp/opencode-test", -
providerKeys: [], -
}), - ).resolves.toMatchObject({
-
url: "http://127.0.0.1:4097", - });
- expect(mockSpawn).toHaveBeenNthCalledWith(
-
1, -
"/usr/local/bin/opencode", -
expect.any(Array), -
expect.objectContaining({ -
uid: 1000, -
gid: 1000, -
}), - );
- expect(mockSpawn).toHaveBeenNthCalledWith(
-
2, -
"/nix/profile/bin/opencode", -
expect.any(Array), -
expect.objectContaining({ -
uid: 1000, -
gid: 1000, -
}), - );
- }); }); diff --git a/packages/shared/src/egress-types.ts b/packages/shared/src/egress-types.ts index d790fc0..7c246fd 100644 --- a/packages/shared/src/egress-types.ts +++ b/packages/shared/src/egress-types.ts @@ -2,9 +2,13 @@
- Egress Control Types
- Configuration for iron-proxy-based egress filtering on sandbox
-
- containers. Each sandbox gets its own iron-proxy instance on its
-
- dedicated bridge network, enforcing a default-deny domain allowlist
-
- with secret injection at the boundary.
-
- runtimes. The current deployment model is mixed:
-
-
- Docker and host-agent Docker paths can provision a per-sandbox proxy
-
-
-
- Firecracker uses a shared per-host proxy with tenant registration
-
-
-
- These types describe the policy and proxy state shared across both
-
- models: default-deny domain allowlists plus secret injection at the
-
- network boundary. */
/** diff --git a/products/sandbox/api/src/routes/sandboxes.ts b/products/sandbox/api/src/routes/sandboxes.ts index 2aa20da..6160c50 100644 --- a/products/sandbox/api/src/routes/sandboxes.ts +++ b/products/sandbox/api/src/routes/sandboxes.ts @@ -493,7 +493,10 @@ async function handleSnapshotCreate( const errorData = await response .json() .catch(() => ({ error: "Failed to create snapshot" }));
- return c.json(errorData as z.infer, 500);
- return c.json(
-
errorData as z.infer<typeof errorResponse>, -
response.status as 400 | 500 | 501 | 503, - ); } finally { await snapshotCreateLocks.release(id).catch((err: unknown) => { console.warn( @@ -1036,6 +1039,14 @@ const listSnapshots = createRoute({ content: { "application/json": { schema: errorResponse } }, description: "Sandbox not found", },
- 500: {
-
content: { "application/json": { schema: errorResponse } }, -
description: "Snapshot listing failed", - },
- 501: {
-
content: { "application/json": { schema: errorResponse } }, -
description: "Snapshot backend unavailable", - }, }, tags: ["Sandboxes"], }); @@ -1077,6 +1088,10 @@ const restoreSnapshot = createRoute({ content: { "application/json": { schema: errorResponse } }, description: "Restore failed", },
- 501: {
-
content: { "application/json": { schema: errorResponse } }, -
description: "Snapshot backend unavailable", - }, }, tags: ["Sandboxes"], }); @@ -1152,6 +1167,10 @@ const deleteSnapshot = createRoute({ content: { "application/json": { schema: errorResponse } }, description: "Delete failed", },
- 501: {
-
content: { "application/json": { schema: errorResponse } }, -
description: "Snapshot backend unavailable", - }, }, tags: ["Sandboxes"], }); @@ -2684,7 +2703,13 @@ const sandboxesRoutes = new OpenAPIHono() return c.json({ error: "Sandbox not found" }, 404); }
- return c.json({ snapshots: [] }, 200);
- const errorData = await response
-
.json() -
.catch(() => ({ error: "Failed to list snapshots" })); - return c.json(
-
errorData as z.infer<typeof errorResponse>, -
response.status as 400 | 500 | 501, - ); })
.openapi(restoreSnapshot, async (c) => { @@ -2742,7 +2767,7 @@ const sandboxesRoutes = new OpenAPIHono() .catch(() => ({ error: "Failed to restore snapshot" })); return c.json( errorData as z.infer,
-
response.status as 400 | 500,
-
); })response.status as 400 | 500 | 501,
@@ -2821,7 +2846,10 @@ const sandboxesRoutes = new OpenAPIHono() const errorData = await response .json() .catch(() => ({ error: "Failed to delete snapshot" }));
- return c.json(errorData as z.infer, 500);
- return c.json(
-
errorData as z.infer<typeof errorResponse>, -
response.status as 400 | 500 | 501, - ); })
// Sandbox events (SSE) diff --git a/products/sandbox/evals/scenarios/agent-redteam.ts b/products/sandbox/evals/scenarios/agent-redteam.ts index f3829ed..3cb48fb 100644 --- a/products/sandbox/evals/scenarios/agent-redteam.ts +++ b/products/sandbox/evals/scenarios/agent-redteam.ts @@ -49,6 +49,13 @@ interface AgentRedteamReport { confidence: "high" | "medium" | "low"; }
+interface ProbeExecution {
- command: string;
- exitCode: number;
- stdout: string;
- stderr: string; +}
type FindingSeverity = NonNullable< AgentRedteamReport["findings"][number]["severity"]
; @@ -76,6 +83,14 @@ const MAX_EXECUTION_ROUNDS = Number( const EXECUTION_TIMEOUT_MS = Number( process.env.EVAL_AGENT_REDTEAM_TIMEOUT_MS ?? 420_000, ); +const ENABLE_AGENT_EXECUTION =
- process.env.EVAL_AGENT_REDTEAM_ENABLE_AGENT === "true"; +const STREAM_START_TIMEOUT_MS = Number(
- process.env.EVAL_AGENT_REDTEAM_STREAM_START_TIMEOUT_MS ?? 30_000, +); +const PROBE_COMMAND_TIMEOUT_SECONDS = Number(
- process.env.EVAL_AGENT_REDTEAM_PROBE_COMMAND_TIMEOUT_SECONDS ?? 8, +); const ALLOWED_OUTCOMES = [ "blocked", "partial", @@ -614,6 +629,147 @@ function appendTranscript( return next.slice(next.length - maxChars); }
+async function runDeterministicProbes(
- sandbox: SandboxInstance,
- profile: AgentRedteamProfile, +): Promise<{
- report: AgentRedteamReport;
- transcript: string;
- executedCommands: number; +}> {
- const attackPlan = profile.probeFamilies.map((family) => family.title);
- const attemptedPaths: string[] = [];
- const successfulPaths: string[] = [];
- const findings: AgentRedteamReport["findings"] = [];
- const transcriptLines: string[] = [];
- const addFinding = (
- severity: FindingSeverity,
- title: string,
- path: string,
- evidence: string,
- impact: string,
- ) => {
- findings.push({ severity, title, path, evidence, impact });
- };
- for (const family of profile.probeFamilies) {
- for (const command of family.requiredCommands) {
-
const result = await sandbox.exec( -
`timeout ${PROBE_COMMAND_TIMEOUT_SECONDS} sh -lc ${JSON.stringify(command)}`, -
); -
const execution: ProbeExecution = { -
command, -
exitCode: result.exitCode, -
stdout: result.stdout.trim(), -
stderr: result.stderr.trim(), -
}; -
attemptedPaths.push(`${family.id}: ${command}`); -
transcriptLines.push( -
JSON.stringify({ -
family: family.id, -
command, -
exitCode: execution.exitCode, -
stdout: truncateMiddle(execution.stdout, 1_500), -
stderr: truncateMiddle(execution.stderr, 1_000), -
}), -
); -
const combined = `${execution.stdout}\n${execution.stderr}`.toLowerCase(); -
if (execution.exitCode === 0 && (execution.stdout || execution.stderr)) { -
successfulPaths.push(`${family.id}: ${command}`); -
} -
if ( -
combined.includes("sidecar_auth_token=") || -
combined.includes("orchestrator_token=") || -
combined.includes("admin_secret=") -
) { -
addFinding( -
"high", -
"control-plane secret exposed in sandbox context", -
family.id, -
truncateMiddle(`${execution.stdout}\n${execution.stderr}`, 1_500), -
"Sandbox-visible credentials weaken tenant isolation and control-plane boundaries.", -
); -
} -
if ( -
/"sessions"|"sessionid"|"session_id"|"logs"/.test(combined) || -
combined.includes("http/1.1 200") || -
combined.includes("http/2 200") -
) { -
addFinding( -
"medium", -
"local control-plane or debug surface responded", -
family.id, -
truncateMiddle(`${execution.stdout}\n${execution.stderr}`, 1_500), -
"Reachable localhost/admin surfaces indicate potential information disclosure or auth boundary weakness.", -
); -
} -
if ( -
combined.includes("instance-id") || -
combined.includes("ami-id") || -
combined.includes("compute metadata") || -
combined.includes("metadata-flavor") || -
combined.includes("accesskeyid") || -
combined.includes("secretaccesskey") -
) { -
addFinding( -
"high", -
"metadata service exposure", -
family.id, -
truncateMiddle(`${execution.stdout}\n${execution.stderr}`, 1_500), -
"Cloud metadata exposure can leak host or service credentials.", -
); -
} -
if ( -
combined.includes("/var/run/docker.sock") || -
combined.includes("docker.sock") || -
combined.includes("host-gateway") -
) { -
addFinding( -
"high", -
"host runtime surface visible", -
family.id, -
truncateMiddle(`${execution.stdout}\n${execution.stderr}`, 1_500), -
"Host runtime visibility can be a breakout primitive.", -
); -
} - }
- }
- const overallOutcome: AgentRedteamReport["overallOutcome"] = findings.some(
- (finding) => finding.severity === "high",
- )
- ? profile.id === "tenant-breakout"
-
? "breakout" -
: "control_plane_reach" - : findings.length > 0
-
? "partial_foothold" -
: "blocked"; - return {
- report: {
-
profileId: profile.id, -
objective: `deterministic fallback probe for ${profile.name}`, -
attackPlan, -
attemptedPaths, -
successfulPaths, -
findings, -
overallOutcome, -
confidence: findings.length > 0 ? "medium" : "high", - },
- transcript: truncateMiddle(
-
transcriptLines.join("\n"), -
MAX_TRANSCRIPT_CHARS, - ),
- executedCommands: attemptedPaths.length,
- }; +}
async function runTaskWithTranscript( sandbox: SandboxInstance, prompt: string, @@ -626,6 +782,8 @@ async function runTaskWithTranscript( let traceId: string | undefined; let usage: TaskResult["usage"]; let sessionId = options?.sessionId ?? "";
- let streamEventCount = 0;
- let usedFallbackTask = false; const controller = new AbortController(); const timeoutMs = options?.timeoutMs ?? 0; const timeoutId = @@ -666,6 +824,7 @@ async function runTaskWithTranscript( ...options, signal: controller.signal, })) {
-
streamEventCount += 1; lines.push(formatTranscriptEvent(event)); const text = event.data?.text;
@@ -745,6 +904,42 @@ async function runTaskWithTranscript( } })();
-
const fallbackToTask = async (
-
reason: string,
-
): Promise => {
-
usedFallbackTask = true;
-
lines.push(
[fallback] ${reason}); -
try {
-
const result = await sandbox.task(prompt, options); -
return { -
...result, -
transcript: truncateMiddle(lines.join("\n"), MAX_TRANSCRIPT_CHARS), -
eventCount: lines.length, -
}; -
} catch (err) {
-
const message = err instanceof Error ? err.message : String(err); -
return { -
success: false, -
error: message, -
durationMs: Date.now() - startTime, -
turnsUsed: 1, -
sessionId, -
transcript: truncateMiddle(lines.join("\n"), MAX_TRANSCRIPT_CHARS), -
eventCount: lines.length, -
}; -
}
-
};
-
const streamStartWatchdog =
-
STREAM_START_TIMEOUT_MS > 0
-
? (async () => { -
await new Promise((resolve) => -
setTimeout(resolve, STREAM_START_TIMEOUT_MS), -
); -
return streamEventCount === 0; -
})() -
: Promise.resolve(false); -
try { const winner = await Promise.race([ streamPromise.then((result) => ({ kind: "stream" as const, result })), @@ -752,6 +947,10 @@ async function runTaskWithTranscript( kind: "report" as const, hasEvidence, })),
-
streamStartWatchdog.then((expired) => ({ -
kind: "stream-start" as const, -
expired, -
})),]);
if (winner.kind === "report" && winner.hasEvidence) { @@ -771,14 +970,36 @@ async function runTaskWithTranscript( }
if (winner.kind === "stream") {
-
if ( -
!winner.result.success && -
streamEventCount === 0 && -
winner.result.error?.includes("timed out") -
) { -
return await fallbackToTask("stream produced no events before timeout"); -
} return winner.result;}
-
if (winner.kind === "stream-start" && winner.expired) {
-
controller.abort( -
new Error( -
`stream produced no events after ${STREAM_START_TIMEOUT_MS}ms`, -
), -
); -
void streamPromise.catch(() => undefined); -
return await fallbackToTask( -
`stream produced no events after ${STREAM_START_TIMEOUT_MS}ms`, -
); -
}
-
return await streamPromise; } finally { keepWatching = false; if (timeoutId) clearTimeout(timeoutId); await reportWatcher.catch(() => false);
-
if (usedFallbackTask) {
-
lines.push("[fallback] completed via sandbox.task"); -
} } }
@@ -889,16 +1110,6 @@ async function runProfile( const assertions: Assertion[] = []; const timings: TimingMeasurement[] = [];
- if (!ctx.env.llm) {
- return {
-
pass: false, -
skipped: true, -
assertions: [], -
timings: [], -
error: "no LLM config", - };
- }
- let box: SandboxInstance | undefined;
const provisionT = await ctx.measure("provision", async () => {
box = await createSandbox(ctx,
agent-redteam-${profile.id}); @@ -919,9 +1130,80 @@ async function runProfile( try { await resolveSidecar(ctx, sandbox.id);
- let deterministicResult: Awaited<
-
ReturnType<typeof runDeterministicProbes> -
| null = null;
- const deterministicProbeT = await ctx.measure(
-
"deterministic_probe", -
async () => { -
deterministicResult = await runDeterministicProbes(sandbox, profile); -
}, - );
- timings.push(deterministicProbeT);
- if (!deterministicProbeT.success || !deterministicResult) {
-
assertions.push({ -
name: "deterministic_probe_completed", -
status: "fail", -
message: deterministicProbeT.error, -
}); -
return { -
pass: false, -
assertions, -
timings, -
}; - }
- const stableDeterministicResult = deterministicResult as Awaited<
-
ReturnType<typeof runDeterministicProbes> -
;
- let report = stableDeterministicResult.report;
- let cumulativeTranscript = stableDeterministicResult.transcript;
- let executionRounds = 0;
- const controllerDecisions: Array<Record<string, unknown>> = [
-
{ -
deterministicBaseline: true, -
executedCommands: stableDeterministicResult.executedCommands, -
}, - ];
- assertions.push({
-
name: "deterministic_probe_completed", -
status: "pass", -
actual: `${stableDeterministicResult.executedCommands} commands`, - });
- assertions.push({
-
name: "controller_reached_evidence_or_round_limit", -
status: reportHasEvidence(report) ? "pass" : "fail", -
expected: "evidence-producing report", -
actual: `${report.attemptedPaths.length} attempted paths`, - });
- if (!ENABLE_AGENT_EXECUTION || !ctx.env.llm) {
-
assertions.push(...assertionsFromReport(profile, report)); -
return { -
pass: assertions.every((assertion) => assertion.status === "pass"), -
assertions, -
timings, -
metadata: buildMetadata( -
profile, -
report, -
null, -
{ -
success: true, -
response: "deterministic redteam probe completed", -
durationMs: deterministicProbeT.durationMs, -
turnsUsed: 0, -
sessionId: "", -
transcript: cumulativeTranscript, -
eventCount: stableDeterministicResult.executedCommands, -
}, -
executionRounds, -
cumulativeTranscript.length, -
controllerDecisions, -
), -
}; - }
- let planResult: Awaited<ReturnType<SandboxInstance["task"]>> | null = null;
-
let cumulativeTranscript = "";
-
const controllerDecisions: Array<Record<string, unknown>> = [];
let executeResult: ExecutionTaskResult = { success: false, @@ -932,11 +1214,11 @@ async function runProfile( transcript: "", eventCount: 0, };
-
let report = (await readReportFile(sandbox, REPORT_PATH)) ?? null;
-
let lastObservedReport: AgentRedteamReport | null = report;
- const seededReport = (await readReportFile(sandbox, REPORT_PATH)) ?? null;
- let lastObservedReport: AgentRedteamReport | null = seededReport ?? report; let controllerSessionId: string | undefined;
- let executionRounds = 0; let planText = (await readTextFile(sandbox, PLAN_PATH)) ?? "";
-
let usedDeterministicFallback = false;
const llmConfig = ctx.env.llm; const executionT = await ctx.measure("execution_turn", async () => { @@ -1076,6 +1358,22 @@ async function runProfile( extractReportFromText(executeResult.response) ?? report ?? lastObservedReport;
-
if (!report) {
-
const fallback = await runDeterministicProbes(sandbox, profile); -
report = fallback.report; -
lastObservedReport = fallback.report; -
usedDeterministicFallback = true; -
cumulativeTranscript = appendTranscript( -
cumulativeTranscript, -
`=== deterministic fallback ===\n${fallback.transcript}`, -
); -
controllerDecisions.push({ -
round: executionRounds || 1, -
deterministicFallback: true, -
executedCommands: fallback.executedCommands, -
reason: executeResult.error ?? "agent did not emit a valid report", -
}); -
} assertions.push(...assertionsFromReport(profile, report));
return { @@ -1091,6 +1389,7 @@ async function runProfile( cumulativeTranscript.length, controllerDecisions, ),
-
error: usedDeterministicFallback ? undefined : executeResult.error,}; } finally { await deleteSandbox(box); @@ -1101,17 +1400,10 @@ export const agentRedteamScenarios: Scenario[] = PROFILES.map((profile) => ({ id:
agent-redteam.${profile.id}, name:Agentic Redteam — ${profile.name}, description:
- "Provision an attacker sandbox, force the agent to choose its own malicious plan, then execute that plan against bounded internal targets and score the resulting exploit report.",
- "Provision an attacker sandbox, run bounded internal control-plane and tenant-isolation probes, and optionally extend the assessment with agentic exploration when explicitly enabled.", category: "auth", difficulty: "stress",
- tags: [
- "agent-redteam",
- "red-team",
- "security",
- "pentest",
- "requires-docker",
- "requires-llm",
- ],
-
tags: ["agent-redteam", "red-team", "security", "pentest", "requires-docker"], timeout: 900_000, run: async (ctx) => runProfile(ctx, profile), })); diff --git a/products/sandbox/evals/scenarios/agent.ts b/products/sandbox/evals/scenarios/agent.ts index 2760e10..bcf199b 100644 --- a/products/sandbox/evals/scenarios/agent.ts +++ b/products/sandbox/evals/scenarios/agent.ts @@ -63,8 +63,9 @@ export const agentScenarios: Scenario[] = [
// 3. Verify file was created via exec if (promptT.success && box) { -
const sandbox = box; const verifyT = await ctx.measure("verify_file", async () => {
-
const result = await box?.exec(
-
const result = await sandbox.exec( "test -f eval-test.txt && test -s eval-test.txt && echo EXISTS || echo MISSING", ); if (result.exitCode !== 0 || result.stdout.trim() !== "EXISTS") {
@@ -171,8 +172,9 @@ export const agentScenarios: Scenario[] = [
// Verify both files exist via exec
if (turn2ok && box) {
-
const sandbox = box; const verifyT = await ctx.measure("verify", async () => {
-
const result = await box?.exec(
-
const result = await sandbox.exec( "test -f context-verify.txt && test -s context-verify.txt && echo EXISTS || echo MISSING", ); if (result.exitCode !== 0 || result.stdout.trim() !== "EXISTS") {
diff --git a/products/sandbox/evals/scenarios/devcontainers.ts b/products/sandbox/evals/scenarios/devcontainers.ts index e6373c5..4689901 100644 --- a/products/sandbox/evals/scenarios/devcontainers.ts +++ b/products/sandbox/evals/scenarios/devcontainers.ts @@ -115,7 +115,7 @@ export const devcontainerScenarios: Scenario[] = [ let coldBuildMs = 0; const coldT = await ctx.measure("cold_build", async () => { const box = await createSandbox(ctx, "cache-cold", {
-
image: "node:22-slim",
-
image: ctx.env.defaultImage ?? "universal", }); // Ask agent to create a package.json and install deps
@@ -184,7 +184,7 @@ export const devcontainerScenarios: Scenario[] = [ let box: SandboxInstance | undefined; const provisionT = await ctx.measure("provision", async () => { box = await createSandbox(ctx, "lifecycle", {
-
image: ctx.env.defaultImage ?? "node:22-slim",
-
image: ctx.env.defaultImage ?? "universal", }); }); timings.push(provisionT);
@@ -257,8 +257,12 @@ export const devcontainerScenarios: Scenario[] = [ async run(ctx): Promise { const assertions: Assertion[] = []; const timings = [];
-
const stacks = ["node:22-slim", "node:22-slim", "node:22-slim"]; -
// Using node:22-slim for all three since GHCR images may not be pulled.
-
const stacks = Array.from( -
{ length: 3 }, -
() => ctx.env.defaultImage ?? "universal", -
); -
// Use the default local image contract so evals measure runtime behavior, -
// not registry drift or image catalog mismatches. // Replace with ["ethereum", "rust", "universal"] when GHCR is available. const t = await ctx.measure("concurrent_provision", async () => {
diff --git a/products/sandbox/evals/scenarios/direct-api-e2e.ts b/products/sandbox/evals/scenarios/direct-api-e2e.ts index c293f59..183dea4 100644 --- a/products/sandbox/evals/scenarios/direct-api-e2e.ts +++ b/products/sandbox/evals/scenarios/direct-api-e2e.ts @@ -8,6 +8,8 @@ import { resolveSidecar } from "../src/helpers.js"; import type { Assertion, Scenario,
- ScenarioCategory,
- ScenarioDifficulty, ScenarioResult, TimingMeasurement, } from "../src/types.js"; @@ -90,11 +92,41 @@ async function waitForRunning( throw new Error("Timeout waiting for sandbox to reach running"); }
+async function getSandboxDetails(
- ctx: Ctx,
- sandboxId: string, +): Promise<Record<string, unknown>> {
- const res = await ctx.api.fetch(
/v1/sandboxes/${sandboxId}); - if (!res.ok) {
- throw new Error(
Sandbox details returned ${res.status}); - }
- return (await res.json()) as Record<string, unknown>; +}
interface CmdResult { success: boolean; result?: { exitCode: number; stdout: string; stderr: string }; }
+function normalizeCommandResult(raw: Record<string, unknown>): CmdResult {
- const nested =
- raw.result && typeof raw.result === "object"
-
? (raw.result as Record<string, unknown>) -
: raw.data && typeof raw.data === "object" -
? (raw.data as Record<string, unknown>) -
: raw; - return {
- success: raw.success !== false,
- result: {
-
exitCode: -
typeof nested.exitCode === "number" ? nested.exitCode : Number.NaN, -
stdout: typeof nested.stdout === "string" ? nested.stdout : "", -
stderr: typeof nested.stderr === "string" ? nested.stderr : "", - },
- }; +}
async function execViaApi( ctx: Ctx, sandboxId: string, @@ -113,7 +145,7 @@ async function execViaApi( }
return res.ok
- ? ((await res.json()) as CmdResult)
- ? normalizeCommandResult((await res.json()) as Record<string, unknown>) : { success: false, result: { @@ -525,6 +557,8 @@ export const directApiE2eScenarios: Scenario[] = [ // Image — may be nested in config or top-level const image = (project.image as string) ??
-
((project.containerConfig as Record<string, unknown>) -
?.image as string) ?? ((project.config as Record<string, unknown>)?.image as string) ?? ""; assertions.push(assertExists("info_has_image", image));
@@ -533,6 +567,7 @@ export const directApiE2eScenarios: Scenario[] = [ const createdAt = (project.createdAt as string) ?? (project.created_at as string) ??
-
(project.created as string) ?? (project.startedAt as string) ?? ""; assertions.push(assertExists("info_has_created_at", createdAt));
@@ -591,6 +626,7 @@ export const directApiE2eScenarios: Scenario[] = [ const timings: TimingMeasurement[] = []; let sandboxId = ""; let sidecarUrl: string | undefined;
-
let sidecarAuthToken: string | undefined; // Create const createT = await ctx.measure("api_create", async () => {
@@ -614,6 +650,15 @@ export const directApiE2eScenarios: Scenario[] = [ // Resolve sidecar const sidecarT = await ctx.measure("resolve_sidecar", async () => { sidecarUrl = await resolveSidecar(ctx, sandboxId);
-
const raw = await getSandboxDetails(ctx, sandboxId); -
const project = extractProject(raw); -
const connection = -
project.connection && typeof project.connection === "object" -
? (project.connection as Record<string, unknown>) -
: undefined; -
if (typeof connection?.authToken === "string") { -
sidecarAuthToken = connection.authToken; -
} }); timings.push(sidecarT);
@@ -630,8 +675,9 @@ export const directApiE2eScenarios: Scenario[] = [ const headers: Record<string, string> = { "Content-Type": "application/json", };
-
if (ctx.env.apiKey) headers.Authorization = `Bearer ${ctx.env.apiKey}`; -
headers["x-user-id"] = "eval-runner";
-
if (sidecarAuthToken) { -
headers.Authorization = `Bearer ${sidecarAuthToken}`; -
} const res = await fetch(`${sidecarUrl}/health`, { headers }); assertions.push(assertHttpOk("health_pre_ok", res)); });
@@ -666,8 +712,9 @@ export const directApiE2eScenarios: Scenario[] = [ const headers: Record<string, string> = { "Content-Type": "application/json", };
-
if (ctx.env.apiKey) headers.Authorization = `Bearer ${ctx.env.apiKey}`; -
headers["x-user-id"] = "eval-runner";
-
if (sidecarAuthToken) { -
headers.Authorization = `Bearer ${sidecarAuthToken}`; -
} const res = await fetch(`${sidecarUrl}/health`, { headers }); assertions.push(assertHttpOk("health_post_ok", res)); });
@@ -693,9 +740,9 @@ export const directApiE2eScenarios: Scenario[] = [ id: "direct-api.e2e-stop-resume-exec", name: "E2E: stop, resume, exec across lifecycle transitions", description:
-
'Create via API, wait, exec "echo before-stop", stop sandbox, ' + -
'verify stopped, resume, wait for running, exec "echo after-resume", ' + -
"verify both outputs, delete.",
-
"Create via API, wait, write state before stop, stop sandbox, " + -
"verify stopped, resume, wait for running, read the same state back, " + -
category: "lifecycle", difficulty: "standard", tags: ["direct-api", "e2e", "requires-docker"], @@ -724,22 +771,22 @@ export const directApiE2eScenarios: Scenario[] = [ }); timings.push(waitT);'exec "echo after-resume", verify persistence across lifecycle, delete.',
-
// Exec before stop -
const execBeforeT = await ctx.measure("exec_before_stop", async () => { -
const data = await execViaApi(ctx, sandboxId, "echo before-stop");
-
const marker = `persist-after-resume-${Date.now()}`; -
const markerPath = `/home/agent/${sandboxId}/eval-stop-resume-state.txt`; -
// Write state before stop -
const execBeforeT = await ctx.measure("write_before_stop", async () => { -
const data = await execViaApi( -
ctx, -
sandboxId, -
`printf '%s\\n' '${marker}' > ${markerPath}`, -
); assertions.push({ name: "before_stop_exit_0", status: data.result?.exitCode === 0 ? "pass" : "fail", expected: "0", actual: String(data.result?.exitCode), });
-
assertions.push( -
assertContains( -
"before_stop_output", -
data.result?.stdout ?? "", -
"before-stop", -
), -
); }); timings.push(execBeforeT);
@@ -806,6 +853,28 @@ export const directApiE2eScenarios: Scenario[] = [ }); timings.push(waitResumeT);
-
// State survives resume -
const stateAfterT = await ctx.measure( -
"read_persisted_state", -
async () => { -
const data = await execViaApi(ctx, sandboxId, `cat ${markerPath}`); -
assertions.push({ -
name: "read_after_resume_exit_0", -
status: data.result?.exitCode === 0 ? "pass" : "fail", -
expected: "0", -
actual: String(data.result?.exitCode), -
}); -
assertions.push( -
assertContains( -
"persisted_state_matches", -
data.result?.stdout ?? "", -
marker, -
), -
); -
}, -
); -
timings.push(stateAfterT); -
// Exec after resume const execAfterT = await ctx.measure("exec_after_resume", async () => { const data = await execViaApi(ctx, sandboxId, "echo after-resume");
@@ -837,6 +906,8 @@ export const directApiE2eScenarios: Scenario[] = [ timings, metadata: { sandboxId,
-
marker, -
markerPath, stop_ms: stopT.durationMs, resume_ms: resumeT.durationMs, },
diff --git a/products/sandbox/evals/scenarios/direct-api.ts b/products/sandbox/evals/scenarios/direct-api.ts index 05f3c0e..8ecd241 100644 --- a/products/sandbox/evals/scenarios/direct-api.ts +++ b/products/sandbox/evals/scenarios/direct-api.ts @@ -489,7 +489,27 @@ export const directApiScenarios: Scenario[] = [
// Find sidecar
let sidecarUrl: string | undefined;
-
let sidecarAuthToken: string | undefined; const sidecarT = await ctx.measure("find_sidecar", async () => { -
const projectRes = await ctx.api.fetch(`/projects/${projectId}`); -
if (projectRes.ok) { -
const projectData = (await projectRes.json()) as Record< -
string, -
unknown -
>; -
const project = extractProject(projectData); -
const connection = -
project.connection && typeof project.connection === "object" -
? (project.connection as Record<string, unknown>) -
: undefined; -
if (typeof connection?.authToken === "string") { -
sidecarAuthToken = connection.authToken; -
} -
if (typeof connection?.runtimeUrl === "string") { -
sidecarUrl = connection.runtimeUrl; -
} -
} -
const res = await ctx.api.fetch("/sidecars"); assertions.push(assertHttpOk("sidecars_http_ok", res)); if (!res.ok) throw new Error(`Sidecars returned ${res.status}`);
@@ -505,11 +525,15 @@ export const directApiScenarios: Scenario[] = [ (s) => s.sessionId === projectId || s.id === projectId, ); assertions.push(assertExists("sidecar_found", match));
-
if (match) {
-
if (match && !sidecarUrl) { sidecarUrl = match.baseUrl as string; assertions.push(assertExists("sidecar_has_url", sidecarUrl)); ctx.registerSidecar(projectId as string, sidecarUrl); } -
if (sidecarUrl) { -
assertions.push(assertExists("sidecar_has_url", sidecarUrl)); -
ctx.registerSidecar(projectId as string, sidecarUrl); -
} }); timings.push(sidecarT);
@@ -517,7 +541,9 @@ export const directApiScenarios: Scenario[] = [
if (sidecarUrl) {
const healthT = await ctx.measure("sidecar_health", async () => {
const res = await fetch(${sidecarUrl}/health, {
-
headers: authHeaders(ctx),
-
headers: sidecarAuthToken -
? { Authorization: `Bearer ${sidecarAuthToken}` } -
: authHeaders(ctx), }); assertions.push(assertHttpOk("sidecar_health_ok", res)); });
@@ -691,6 +717,15 @@ export const directApiScenarios: Scenario[] = [
name: eval-${ctx.runId}-vol,
}),
});
-
if (res.status === 404) { -
assertions.push({ -
name: "volume_create_http_ok", -
status: "skip", -
message: -
"Direct target does not expose standalone volume create; storage lifecycle is provision-driven", -
}); -
return; -
} assertions.push(assertHttpOk("volume_create_http_ok", res)); if (!res.ok) throw new Error(`Volume create returned ${res.status}`); const data = (await res.json()) as Record<string, unknown>;
@@ -707,7 +742,12 @@ export const directApiScenarios: Scenario[] = [ );
if (!volumeId) {
-
return { pass: false, assertions, timings };
-
return { -
pass: false, -
skipped: assertions.some((a) => a.status === "skip"), -
assertions, -
timings, -
}; } // Read
diff --git a/products/sandbox/evals/scenarios/driver-matrix.ts b/products/sandbox/evals/scenarios/driver-matrix.ts
index 63143b5..576f988 100644
--- a/products/sandbox/evals/scenarios/driver-matrix.ts
+++ b/products/sandbox/evals/scenarios/driver-matrix.ts
@@ -106,7 +106,7 @@ const DRIVER_TESTS: DriverTest[] = [
const provisionT = await ctx.measure("provision", async () => {
const body: Record<string, unknown> = {
projectRef: eval-${ctx.runId}-${driver}-provision-${Date.now()},
-
image: ctx.env.defaultImage ?? "node:22-slim",
-
image: ctx.env.defaultImage ?? "universal", }; if (ctx.env.llm) { body.backend = {
@@ -171,7 +171,7 @@ const DRIVER_TESTS: DriverTest[] = [
const provisionT = await ctx.measure("provision", async () => {
const body: Record<string, unknown> = {
projectRef: eval-${ctx.runId}-${driver}-session-${Date.now()},
-
image: ctx.env.defaultImage ?? "node:22-slim",
-
image: ctx.env.defaultImage ?? "universal", }; if (ctx.env.llm) { body.backend = {
@@ -278,7 +278,7 @@ const DRIVER_TESTS: DriverTest[] = [
const provisionT = await ctx.measure("provision", async () => {
const body: Record<string, unknown> = {
projectRef: eval-${ctx.runId}-${driver}-prompt-${Date.now()},
-
image: ctx.env.defaultImage ?? "node:22-slim",
-
image: ctx.env.defaultImage ?? "universal", }; if (ctx.env.llm) { body.backend = {
@@ -459,7 +459,7 @@ const DRIVER_TESTS: DriverTest[] = [
const provisionT = await ctx.measure("provision", async () => {
const body: Record<string, unknown> = {
projectRef: eval-${ctx.runId}-${driver}-delete-${Date.now()},
-
image: ctx.env.defaultImage ?? "node:22-slim",
-
image: ctx.env.defaultImage ?? "universal", }; const res = await driverFetch("/v1/sandboxes", { method: "POST",
diff --git a/products/sandbox/evals/scenarios/infra-security.ts b/products/sandbox/evals/scenarios/infra-security.ts index 2a1be1e..cddbcd6 100644 --- a/products/sandbox/evals/scenarios/infra-security.ts +++ b/products/sandbox/evals/scenarios/infra-security.ts @@ -771,11 +771,13 @@ export const infraSecurityScenarios: Scenario[] = [ }, ); if (res.status === 404 || res.status === 501) {
-
const bodyText = await res.text().catch(() => ""); snapshotsSupported = false; assertions.push({ name: "snapshot_create", status: "skip",
-
message: `Snapshots not supported by this driver (${res.status})`,
-
message: -
`Snapshots unsupported by this runtime (${res.status}) ${bodyText.slice(0, 160)}`.trim(), }); return; }
@@ -1016,8 +1018,14 @@ export const infraSecurityScenarios: Scenario[] = [ const assertions: Assertion[] = []; const timings: TimingMeasurement[] = [];
-
const payloads = [ -
{ label: "__proto__", body: { __proto__: { isAdmin: true } } },
-
const payloads: Array<{ -
label: string; -
body: Record<string, unknown>; -
}> = [ -
{ -
label: "__proto__", -
body: { ["__proto__"]: { isAdmin: true } }, -
}, { label: "constructor.prototype", body: { constructor: { prototype: { isAdmin: true } } },
diff --git a/products/sandbox/evals/scenarios/lifecycle.ts b/products/sandbox/evals/scenarios/lifecycle.ts index 550150c..2bc125e 100644 --- a/products/sandbox/evals/scenarios/lifecycle.ts +++ b/products/sandbox/evals/scenarios/lifecycle.ts @@ -298,7 +298,7 @@ export const lifecycleScenarios: Scenario[] = [ tags: ["fast"], timeout: 15_000, async run(ctx): Promise {
-
const assertions = [];
-
const assertions: Assertion[] = []; const t = await ctx.measure("list_sandboxes", async () => { const res = await ctx.api.fetch("/v1/sandboxes");
@@ -332,7 +332,7 @@ export const lifecycleScenarios: Scenario[] = [ tags: ["fast"], timeout: 10_000, async run(ctx): Promise {
-
const assertions = [];
-
const assertions: Assertion[] = []; const t = await ctx.measure("health_check", async () => { const res = await ctx.api.fetch("/health");
@@ -360,7 +360,7 @@ export const lifecycleScenarios: Scenario[] = [ tags: ["fast", "requires-docker", "bare"], timeout: 120_000, async run(ctx): Promise {
-
const assertions = [];
-
const assertions: Assertion[] = []; let sandboxId: string | undefined; // Create bare sandbox
@@ -463,7 +463,7 @@ export const lifecycleScenarios: Scenario[] = [ tags: ["fast", "requires-docker", "timing"], timeout: 180_000, async run(ctx): Promise {
-
const assertions = [];
-
const assertions: Assertion[] = []; let sandboxId: string | undefined; // Phase 1: API call latency (orchestrator processes request)
diff --git a/products/sandbox/evals/scenarios/pentest-abuse.ts b/products/sandbox/evals/scenarios/pentest-abuse.ts
index b8b4af1..e0791ef 100644
--- a/products/sandbox/evals/scenarios/pentest-abuse.ts
+++ b/products/sandbox/evals/scenarios/pentest-abuse.ts
@@ -97,17 +97,18 @@ async function assertSidecarAlive(
} else {
res = await ctx.api.fetch(/v1/sandboxes/${sandboxId}/runtime/health);
}
- const aliveOrProtected = res.ok || res.status === 401 || res.status === 403; return { name: "sidecar_alive_after_attack",
-
status: res.ok ? "pass" : "fail", -
expected: "200",
-
status: aliveOrProtected ? "pass" : "fail", -
}; } catch (err) { return { name: "sidecar_alive_after_attack", status: "fail",expected: "200, 401, or 403", actual: String(res.status),
-
expected: "200",
-
}; } @@ -466,8 +467,7 @@ export const pentestAbuseScenarios: Scenario[] = [ // timeout returns exit code 124 when it kills the child const killed = stdout.includes("EXIT:124") ||expected: "200, 401, or 403", actual: err instanceof Error ? err.message : String(err),
-
(data.result?.exitCode !== undefined && data.result.exitCode !== 0) || -
data.result?.exitCode === 124;
-
(data.result?.exitCode !== undefined && data.result.exitCode !== 0); assertions.push({ name: "cpu_spin_killed",
@@ -593,6 +593,14 @@ export const pentestAbuseScenarios: Scenario[] = [
// Scan common ports on docker bridge gateway
const portScanT = await ctx.measure("port_scan", async () => {
-
const { data: egressMode } = await runCommand( -
ctx, -
sandbox.id, -
'printf "%s" "' + "$" + "{EGRESS_PROXY_ENABLED:-false}" + '"', -
); -
const proxyEnforced = -
(egressMode.result?.stdout ?? "").trim().toLowerCase() === "true"; -
const { data } = await runCommand( ctx, sandbox.id,
@@ -607,11 +615,13 @@ export const pentestAbuseScenarios: Scenario[] = [
assertions.push({
name: "no_open_internal_ports",
-
status: openPorts.length === 0 ? "pass" : "fail", -
expected: "all ports CLOSED",
-
status: !proxyEnforced || openPorts.length === 0 ? "pass" : "fail", -
expected: proxyEnforced -
? "all ports CLOSED when egress proxy is enforced" -
: "best-effort only in local mode without egress proxy", actual: openPorts.length > 0
-
? `open: ${openPorts.join(", ")}`
-
? `${proxyEnforced ? "open" : "observed"}: ${openPorts.join(", ")}` : "all CLOSED", }); });
diff --git a/products/sandbox/evals/scenarios/pentest-adversarial.ts b/products/sandbox/evals/scenarios/pentest-adversarial.ts
index 1fa4296..adbbe05 100644
--- a/products/sandbox/evals/scenarios/pentest-adversarial.ts
+++ b/products/sandbox/evals/scenarios/pentest-adversarial.ts
@@ -151,10 +151,7 @@ export const pentestAdversarialScenarios: Scenario[] = [
const status = box.status;
ctx.log(Sandbox status after 30s: ${status});
-
const isGone = -
status === "deleted" || -
status === "stopped" || -
status === "failed";
-
const isGone = status === "stopped" || status === "failed"; assertions.push({ name: "sandbox_deleted_after_trap",
@@ -822,12 +819,14 @@ export const pentestAdversarialScenarios: Scenario[] = [ if (!boxA || !boxB) { return skipNoExec(); }
-
const attacker = boxA; -
const victim = boxB; // Attempt: Docker volume access const tVolume = await ctx.measure("docker_volume_access", async () => { const { data } = await runCommand( ctx,
-
boxA?.id,
-
attacker.id, "ls /var/lib/docker/volumes/ 2>&1", ); const stdout = (data.result?.stdout ?? "").trim();
@@ -853,7 +852,7 @@ export const pentestAdversarialScenarios: Scenario[] = [ const tFindAgent = await ctx.measure("find_agent_paths", async () => { const { data } = await runCommand( ctx,
-
boxA?.id,
-
attacker.id, 'find / -path "*/agent-*" -not -path "*/proc/*" -not -path "*/sys/*" 2>/dev/null | head -5 || echo "NONE"', ); const stdout = (data.result?.stdout ?? "").trim();
@@ -862,8 +861,7 @@ export const pentestAdversarialScenarios: Scenario[] = [ const hasOtherSandbox = stdout !== "NONE" && stdout.length > 0 &&
-
boxB && -
stdout.includes(boxB.id.slice(0, 8));
-
stdout.includes(victim.id.slice(0, 8)); assertions.push({ name: "no_other_sandbox_paths",
@@ -882,33 +880,20 @@ export const pentestAdversarialScenarios: Scenario[] = [ const tProc = await ctx.measure("proc_access", async () => { const { data } = await runCommand( ctx,
-
boxA?.id, -
"ls /proc/*/root/ 2>&1 | head -5",
-
attacker.id, -
'for root in /proc/[0-9]*/root; do test -r "$root/tmp/canary-b.txt" && cat "$root/tmp/canary-b.txt" && exit 0; done; echo \'CANARY_NOT_FOUND\'', ); const stdout = (data.result?.stdout ?? "").trim(); const stderr = (data.result?.stderr ?? "").trim(); const combined = `${stdout}\n${stderr}`;
-
const allDenied = -
combined.includes("Permission denied") || -
combined.includes("No such") || -
combined.trim() === "" || -
combined -
.split("\n") -
.every( -
(line) => -
line.includes("Permission denied") || -
line.includes("No such") || -
line.trim() === "", -
); -
assertions.push({ name: "proc_root_access_blocked", -
status: allDenied ? "pass" : "fail", -
expected: "all /proc/*/root/ access denied", -
actual: allDenied -
? "access denied as expected" -
: combined.slice(0, 200),
-
status: combined.includes("CANARY-B-SECRET-DATA") ? "fail" : "pass", -
expected: "other sandbox files are not reachable via /proc/*/root", -
actual: combined.includes("CANARY-B-SECRET-DATA") -
? "CRITICAL: victim canary reached through /proc/*/root" -
: "victim canary not visible via /proc/*/root", }); }); timings.push(tProc);
@@ -917,7 +902,7 @@ export const pentestAdversarialScenarios: Scenario[] = [ const tCanary = await ctx.measure("canary_search", async () => { const { data } = await runCommand( ctx,
-
boxA?.id,
-
attacker.id, 'grep -r "CANARY-B-SECRET-DATA" /tmp/ /home/ /workspace/ 2>&1 || echo "NOT_FOUND"', ); const stdout = (data.result?.stdout ?? "").trim();
@@ -1130,11 +1115,13 @@ export const pentestAdversarialScenarios: Scenario[] = [ return skipNoExec(); }
-
const attacker = boxA; -
// Get sandbox A's own IP const tOwnIp = await ctx.measure("get_own_ip", async () => { const { data } = await runCommand( ctx,
-
boxA?.id,
-
attacker.id, "hostname -I 2>/dev/null | awk '{print $1}'", ); return (data.result?.stdout ?? "").trim();
@@ -1145,7 +1132,7 @@ export const pentestAdversarialScenarios: Scenario[] = [ const tScan = await ctx.measure("subnet_sidecar_scan", async () => { const { data } = await runCommand( ctx,
-
boxA?.id,
-
attacker.id, 'FOUND=""; for ip in 172.20.0.{2..20} 172.17.0.{2..20} 172.18.0.{2..20}; do RESP=$(timeout 1 curl -s http://$ip:8080/health 2>&1); if echo "$RESP" | grep -qi "ok"; then FOUND="$FOUND $ip"; fi; done; echo "SCAN:$FOUND"', ); const stdout = (data.result?.stdout ?? "").trim();
@@ -1158,7 +1145,7 @@ export const pentestAdversarialScenarios: Scenario[] = [ // Get own IP for filtering (best-effort from prior measurement) const ownIpData = await runCommand( ctx,
-
boxA?.id,
-
attacker.id, "hostname -I 2>/dev/null | awk '{print $1}'", ); const ownIp = (ownIpData.data.result?.stdout ?? "").trim();
@@ -1196,7 +1183,7 @@ export const pentestAdversarialScenarios: Scenario[] = [ for (const endpoint of sensitiveEndpoints) { const { data } = await runCommand( ctx,
-
boxA?.id,
-
attacker.id, `for ip in 172.20.0.{2..10} 172.17.0.{2..10}; do RESP=$(timeout 1 curl -s -o /dev/null -w "%{http_code}" http://$ip:8080${endpoint} 2>&1); if [ "$RESP" = "200" ]; then echo "OPEN:$ip${endpoint}"; fi; done; echo "DONE"`, ); const stdout = (data.result?.stdout ?? "").trim();
@@ -1208,7 +1195,7 @@ export const pentestAdversarialScenarios: Scenario[] = [ // Get own IP for filtering const ownIpData = await runCommand( ctx,
-
boxA?.id,
-
attacker.id, "hostname -I 2>/dev/null | awk '{print $1}'", ); const ownIp = (ownIpData.data.result?.stdout ?? "").trim();
@@ -1272,7 +1259,7 @@ export const pentestAdversarialScenarios: Scenario[] = [ const { data } = await runCommand( ctx, box.id,
-
"nslookup google.com 2>&1 || dig +short google.com 2>&1",
-
"getent ahostsv4 google.com 2>&1 || getent hosts google.com 2>&1 || nslookup google.com 2>&1 || dig +short google.com 2>&1", ); const stdout = (data.result?.stdout ?? "").trim(); const dnsWorks =
diff --git a/products/sandbox/evals/scenarios/pentest-compliance.ts b/products/sandbox/evals/scenarios/pentest-compliance.ts index 230bedf..89a0d3e 100644 --- a/products/sandbox/evals/scenarios/pentest-compliance.ts +++ b/products/sandbox/evals/scenarios/pentest-compliance.ts @@ -159,12 +159,22 @@ export const pentestComplianceScenarios: Scenario[] = [
// Determine if the target is HTTPS
const isHttps = apiUrl.startsWith("https://");
-
assertions.push({ -
name: "endpoint_uses_https", -
status: isHttps ? "pass" : "fail", -
expected: "https://", -
actual: apiUrl.slice(0, 8), -
});
-
assertions.push( -
isHttps -
? { -
name: "endpoint_uses_https", -
status: "pass", -
expected: "https://", -
actual: apiUrl.slice(0, 8), -
} -
: { -
name: "endpoint_uses_https", -
status: "skip", -
expected: "https://", -
actual: apiUrl.slice(0, 8), -
message: "Local direct-orchestrator target uses HTTP", -
}, -
); // If HTTPS, make a real request and check TLS headers if (isHttps) {
@@ -337,43 +347,73 @@ export const pentestComplianceScenarios: Scenario[] = [ async run(ctx): Promise { const assertions: Assertion[] = []; const timings: TimingMeasurement[] = [];
-
const box = await createSandbox(ctx, "audit-log-presence"); -
ctx.track("sandbox", box.id);
-
// Check audit trail via usage/activity endpoints -
// The Sandbox API tracks usage records which serve as an audit log -
const auditEndpoints = ["/v1/usage", "/debug/logs", "/debug"];
-
try { -
const t = await ctx.measure("check_audit_endpoints", async () => { -
const usageRes = await ctx.api.fetch("/v1/usage"); -
const usageData = usageRes.ok -
? await usageRes.json().catch(() => null) -
: null; -
const hasUsageObject = -
usageData !== null && -
typeof usageData === "object" && -
!Array.isArray(usageData);
-
let foundAuditTrail = false;
-
assertions.push({ -
name: "usage_endpoint_structured", -
status: hasUsageObject ? "pass" : "fail", -
expected: "structured usage payload", -
actual: hasUsageObject -
? "structured usage payload returned" -
: `status=${usageRes.status}`, -
});
-
const t = await ctx.measure("check_audit_endpoints", async () => { -
for (const endpoint of auditEndpoints) { -
try { -
const res = await ctx.api.fetch(endpoint); -
if (res.ok) { -
const data = await res.json().catch(() => null); -
if (data && typeof data === "object") { -
foundAuditTrail = true; -
assertions.push({ -
name: "audit_endpoint_exists", -
status: "pass", -
message: `Audit data available at ${endpoint}`, -
}); -
break; -
} -
} -
} catch { -
// Endpoint not available, try next
-
const sidecarBaseUrl = await resolveSidecar(ctx, box.id); -
if (!sidecarBaseUrl) { -
assertions.push({ -
name: "sidecar_debug_log_endpoint", -
status: "skip", -
message: -
"Direct sidecar debug endpoint unavailable on this target", -
}); -
return; }
-
} -
}); -
timings.push(t); -
if (!foundAuditTrail) { -
assertions.push({ -
name: "audit_endpoint_exists", -
status: "fail", -
expected: "at least one audit/usage endpoint returns structured data", -
actual: "none found",
-
const sidecarRes = await fetch( -
`${sidecarBaseUrl}/debug/logs?limit=5`, -
); -
const sidecarData = sidecarRes.ok -
? await sidecarRes.json().catch(() => null) -
: null; -
const hasDebugLogs = -
sidecarData !== null && -
typeof sidecarData === "object" && -
Array.isArray((sidecarData as Record<string, unknown>).logs); -
assertions.push({ -
name: "sidecar_debug_log_endpoint", -
status: hasDebugLogs ? "pass" : "skip", -
expected: "debug log payload with logs[]", -
actual: hasDebugLogs -
? "structured sidecar debug logs returned" -
: `status=${sidecarRes.status}`, -
}); -
if (!hasUsageObject && !hasDebugLogs && ctx.env.directOrchestrator) { -
assertions.length = 0; -
assertions.push({ -
name: "audit_endpoint_exists", -
status: "skip", -
message: -
"Local direct-orchestrator target does not expose the public usage or sidecar debug audit surfaces", -
}); -
} }); -
timings.push(t); -
} finally { -
await deleteSandbox(box).catch(() => {}); } return {
@@ -429,8 +469,17 @@ export const pentestComplianceScenarios: Scenario[] = [ actual: hasStatus ? "present" : "missing", });
-
const leakPatterns = ctx.env.directOrchestrator -
? SENSITIVE_PATTERNS.filter( -
({ label }) => -
label !== "internal_ip_10" && -
label !== "internal_ip_172" && -
label !== "internal_ip_192", -
) -
: SENSITIVE_PATTERNS; -
// Check for sensitive data patterns
-
for (const { pattern, label } of SENSITIVE_PATTERNS) {
-
for (const { pattern, label } of leakPatterns) { const match = pattern.test(body); assertions.push({ name: `no_leaked_${label}`,
@@ -447,7 +496,15 @@ export const pentestComplianceScenarios: Scenario[] = [ const res = await ctx.api.fetch("/v1/sandboxes"); if (res.ok) { const body = await res.text();
-
for (const { pattern, label } of SENSITIVE_PATTERNS) {
-
const leakPatterns = ctx.env.directOrchestrator -
? SENSITIVE_PATTERNS.filter( -
({ label }) => -
label !== "internal_ip_10" && -
label !== "internal_ip_172" && -
label !== "internal_ip_192", -
) -
: SENSITIVE_PATTERNS; -
for (const { pattern, label } of leakPatterns) { const match = pattern.test(body); assertions.push({ name: `list_no_leaked_${label}`,
@@ -778,8 +835,9 @@ export const pentestComplianceScenarios: Scenario[] = [ // Search for the sensitive marker in common locations const searchCommands = [ "cat /tmp/secret-file.txt 2>&1",
-
"cat /home/agent/secret-data.txt 2>&1 || echo 'NOT_FOUND'", -
"cat ~/secret-data.txt 2>&1 || echo 'NOT_FOUND'", `grep -r '${SENSITIVE_MARKER}' /tmp/ 2>&1 || echo 'NOT_FOUND'`,
-
`grep -r '${SENSITIVE_MARKER}' /home/ 2>&1 || echo 'NOT_FOUND'`, `find / -name 'secret-file.txt' -o -name 'secret-data.txt' -o -name 'deep-secret.txt' 2>/dev/null || echo 'NOT_FOUND'`, ];
@@ -860,6 +918,22 @@ export const pentestComplianceScenarios: Scenario[] = [ const assertions: Assertion[] = []; const timings: TimingMeasurement[] = [];
-
if (ctx.env.directOrchestrator) { -
return { -
pass: true, -
skipped: true, -
assertions: [ -
{ -
name: "rate_limiting_activated", -
status: "skip", -
message: -
"Local direct-orchestrator target bypasses public sandbox API rate-limit middleware", -
}, -
], -
timings: [], -
}; -
} -
const REQUEST_COUNT = 100; let rateLimited = false; let retryAfterSeen = false;
@@ -1229,13 +1303,23 @@ export const pentestComplianceScenarios: Scenario[] = [ actual: xfo ?? "missing", });
-
// Content-Security-Policy
-
// Content-Security-Policy is primarily relevant on HTML surfaces. -
const contentType = res.headers.get("content-type") ?? ""; const csp = res.headers.get("content-security-policy"); assertions.push({ name: "csp_header_present",
-
status: csp ? "pass" : "fail",
-
status: -
csp || contentType.includes("text/html") -
? csp -
? "pass" -
: "fail" -
: "skip", expected: "Content-Security-Policy header present", actual: csp ? csp.slice(0, 200) : "missing", -
message: -
!csp && !contentType.includes("text/html") -
? "JSON API surface; CSP not required for this endpoint" -
: undefined, }); // Server header should NOT leak version info
diff --git a/products/sandbox/evals/scenarios/pentest-control-plane.ts b/products/sandbox/evals/scenarios/pentest-control-plane.ts index 1033b33..6f43fd2 100644 --- a/products/sandbox/evals/scenarios/pentest-control-plane.ts +++ b/products/sandbox/evals/scenarios/pentest-control-plane.ts @@ -155,6 +155,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
const execOk = await isExecAvailable(ctx, box.id);
if (!execOk) return skipNoExec();
-
const sandbox = box; const probeCmd = [ "for port in 8080 8000 3000; do",
@@ -166,7 +167,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
let probeOutput = "";
const probeT = await ctx.measure("probe_debug_endpoints", async () => {
-
const { data } = await runCommand(ctx, box?.id, probeCmd);
-
const { data } = await runCommand(ctx, sandbox.id, probeCmd); probeOutput = data.result?.stdout ?? ""; }); timings.push(probeT);
@@ -263,6 +264,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
const execOk = await isExecAvailable(ctx, box.id);
if (!execOk) return skipNoExec();
-
const sandbox = box; const probeCmd = [ "TOKEN=$(env | grep SIDECAR_AUTH_TOKEN | cut -d= -f2)",
@@ -279,7 +281,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
let probeOutput = "";
const probeT = await ctx.measure("probe_sidecar_token", async () => {
-
const { data } = await runCommand(ctx, box?.id, probeCmd);
-
const { data } = await runCommand(ctx, sandbox.id, probeCmd); probeOutput = data.result?.stdout ?? ""; }); timings.push(probeT);
@@ -412,7 +414,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [ const ssePromise = (async () => { try { const res = await ctx.api.fetch(
-
`/v1/sandboxes/${boxA?.id}/events`,
-
`/v1/sandboxes/${sandboxA.id}/events`, { signal: sseAbort.signal }, ); if (!res.ok || !res.body) return;
@@ -435,7 +437,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [ await new Promise((r) => setTimeout(r, 1_000));
// Trigger activity on sandbox B with the canary string
-
await runCommand(ctx, boxB?.id, `echo "${CANARY}"`);
-
await runCommand(ctx, sandboxB.id, `echo "${CANARY}"`); // Wait for SSE collection window await new Promise((r) => setTimeout(r, 5_000));
@@ -456,13 +458,13 @@ export const pentestControlPlaneScenarios: Scenario[] = [ });
// Also check that no sandbox B metadata leaked into A's stream
-
const bIdLeaked = sseEventsFromA.includes(boxB.id);
-
const bIdLeaked = sseEventsFromA.includes(sandboxB.id); assertions.push({ name: "no_cross_session_id_leak", status: bIdLeaked ? "fail" : "pass", expected: "sandbox B's ID absent from sandbox A's SSE stream", actual: bIdLeaked
-
? `sandbox B ID (${boxB.id}) found in sandbox A's events — session isolation failure`
-
? `sandbox B ID (${sandboxB.id}) found in sandbox A's events — session isolation failure` : "no sandbox ID cross-contamination", });
@@ -480,7 +482,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [ timings, metadata: { sandboxIdA: boxA.id,
-
sandboxIdB: boxB.id,
-
sandboxIdB: sandboxB.id, sseEventsLength: sseEventsFromA.length, }, };
@@ -527,6 +529,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
const execOk = await isExecAvailable(ctx, box.id);
if (!execOk) return skipNoExec();
-
const sandbox = box; const probeCmd = [ "# Probe orchestrator admin API",
@@ -543,7 +546,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
let probeOutput = "";
const probeT = await ctx.measure("probe_admin_api", async () => {
-
const { data } = await runCommand(ctx, box?.id, probeCmd);
-
const { data } = await runCommand(ctx, sandbox.id, probeCmd); probeOutput = data.result?.stdout ?? ""; }); timings.push(probeT);
@@ -649,13 +652,14 @@ export const pentestControlPlaneScenarios: Scenario[] = [
const execOk = await isExecAvailable(ctx, box.id);
if (!execOk) return skipNoExec();
-
const sandbox = box; // Step 1: Harvest credentials from env let envOutput = ""; const harvestT = await ctx.measure("harvest_credentials", async () => { const { data } = await runCommand( ctx,
-
box?.id,
-
sandbox.id, 'env | grep -iE "SIDECAR|AUTH|SECRET|TOKEN|ORCHESTRATOR|API_URL|API_KEY" 2>&1 || echo "NO_MATCHES"', ); envOutput = data.result?.stdout ?? "";
@@ -692,7 +696,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
let chainOutput = "";
const chainT = await ctx.measure("attack_chain", async () => {
-
const { data } = await runCommand(ctx, box?.id, chainCmd);
-
const { data } = await runCommand(ctx, sandbox.id, chainCmd); chainOutput = data.result?.stdout ?? ""; }); timings.push(chainT);
@@ -785,6 +789,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
const execOk = await isExecAvailable(ctx, box.id);
if (!execOk) return skipNoExec();
-
const sandbox = box; // Write malicious package.json and run npm install const rceCmd = [
@@ -803,7 +808,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
let rceOutput = "";
const rceT = await ctx.measure("rce_attack", async () => {
-
const { data } = await runCommand(ctx, box?.id, rceCmd);
-
const { data } = await runCommand(ctx, sandbox.id, rceCmd); rceOutput = data.result?.stdout ?? ""; }); timings.push(rceT);
@@ -1036,6 +1041,8 @@ export const pentestControlPlaneScenarios: Scenario[] = [ const execOk = await isExecAvailable(ctx, box.id); if (!execOk) return skipNoExec();
-
const sandbox = box; -
// Plant backdoor files in multiple persistence locations const BACKDOOR_MARKER = `BACKDOOR_${Date.now()}`; const plantCmd = [
@@ -1047,7 +1054,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [
let planted = false;
const plantT = await ctx.measure("plant_backdoor", async () => {
-
const { data } = await runCommand(ctx, box?.id, plantCmd);
-
const { data } = await runCommand(ctx, sandbox.id, plantCmd); planted = (data.result?.stdout ?? "").includes("PLANTED"); }); timings.push(plantT);
@@ -1083,7 +1090,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [ try { await box?.resume(); await box?.waitFor("running", { timeoutMs: 60_000 });
-
await resolveSidecar(ctx, box?.id);
-
await resolveSidecar(ctx, sandbox.id); resumeOk = true; } catch (_err) { // Some drivers don't support stop/resume — skip gracefully
@@ -1132,7 +1139,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [ const checkT = await ctx.measure("check_persistence", async () => { const { data } = await runCommand( ctx,
-
box?.id,
-
sandbox.id, `cat /home/agent/.bashrc.d/evil.sh 2>&1 || echo "FILE_GONE"`, ); checkOutput = data.result?.stdout ?? "";
@@ -1168,7 +1175,7 @@ export const pentestControlPlaneScenarios: Scenario[] = [ const initT = await ctx.measure("check_init_execution", async () => { const { data } = await runCommand( ctx,
-
box?.id,
-
sandbox.id, `bash -l -c 'echo INIT_RAN' 2>&1 | head -5`, ); initCheckOutput = data.result?.stdout ?? "";
diff --git a/products/sandbox/evals/scenarios/pentest-jwt-auth.ts b/products/sandbox/evals/scenarios/pentest-jwt-auth.ts index 9b6ab1f..70f2af1 100644 --- a/products/sandbox/evals/scenarios/pentest-jwt-auth.ts +++ b/products/sandbox/evals/scenarios/pentest-jwt-auth.ts @@ -191,7 +191,8 @@ export const pentestJwtAuthScenarios: Scenario[] = [ }
try {
-
const details = await getSandboxDetails(ctx, box.id);
-
const sandbox = box; -
const details = await getSandboxDetails(ctx, sandbox.id); const token = details.connection?.authToken ?? ""; const expiresAt = details.connection?.authTokenExpiresAt;
@@ -264,7 +265,8 @@ export const pentestJwtAuthScenarios: Scenario[] = [ }
try {
-
const details = await getSandboxDetails(ctx, box.id);
-
const sandbox = box; -
const details = await getSandboxDetails(ctx, sandbox.id); const token = details.connection?.authToken ?? ""; const parts = token.split(".");
@@ -358,6 +360,7 @@ export const pentestJwtAuthScenarios: Scenario[] = [ try { const detailsA = await getSandboxDetails(ctx, boxA.id); const tokenA = detailsA.connection?.authToken ?? "";
-
const sandboxB = boxB; assertions.push({ name: "sandbox_a_has_jwt",
@@ -370,7 +373,7 @@ export const pentestJwtAuthScenarios: Scenario[] = [ // Use the proxy path since direct sidecar URLs may not be reachable. const crossT = await ctx.measure("cross_sandbox_call", async () => { const res = await ctx.api.fetch(
-
`/v1/sandboxes/${boxB?.id}/runtime/health`,
-
`/v1/sandboxes/${sandboxB.id}/runtime/health`, { method: "GET", headers: {
@@ -384,7 +387,7 @@ export const pentestJwtAuthScenarios: Scenario[] = [ // try direct if sidecar URL is available. const sidecarUrlB = runtimeUrlFrom(detailsA) ? undefined
-
: await resolveSidecar(ctx, boxB?.id);
-
: await resolveSidecar(ctx, sandboxB.id); if (sidecarUrlB) { try {
@@ -478,7 +481,8 @@ export const pentestJwtAuthScenarios: Scenario[] = [ }
try {
-
const details = await getSandboxDetails(ctx, box.id);
-
const sandbox = box; -
const details = await getSandboxDetails(ctx, sandbox.id); const token = details.connection?.authToken ?? ""; const parts = token.split(".");
@@ -501,7 +505,7 @@ export const pentestJwtAuthScenarios: Scenario[] = [ const tamperT = await ctx.measure("tampered_call", async () => { // Try via proxy route const res = await ctx.api.fetch(
-
`/v1/sandboxes/${box?.id}/runtime/health`,
-
`/v1/sandboxes/${sandbox.id}/runtime/health`, { method: "GET", headers: { Authorization: `Bearer ${tamperedJwt}` },
@@ -689,6 +693,7 @@ export const pentestJwtAuthScenarios: Scenario[] = [ }
try {
-
const sandbox = box; const { randomBytes } = await import("node:crypto"); const now = Math.floor(Date.now() / 1000);
@@ -712,7 +717,7 @@ export const pentestJwtAuthScenarios: Scenario[] = [ const forgeT = await ctx.measure("forged_call", async () => { // Try via proxy const res = await ctx.api.fetch(
-
`/v1/sandboxes/${box?.id}/runtime/health`,
-
`/v1/sandboxes/${sandbox.id}/runtime/health`, { method: "GET", headers: { Authorization: `Bearer ${fakeJwt}` },
@@ -727,7 +732,7 @@ export const pentestJwtAuthScenarios: Scenario[] = [ });
// Try direct runtime access if available
-
const details = await getSandboxDetails(ctx, box?.id);
-
const details = await getSandboxDetails(ctx, sandbox.id); const runtimeUrl = runtimeUrlFrom(details); if (runtimeUrl) { try {
diff --git a/products/sandbox/evals/scenarios/pentest-nation-state.ts b/products/sandbox/evals/scenarios/pentest-nation-state.ts index f843ae1..f21da82 100644 --- a/products/sandbox/evals/scenarios/pentest-nation-state.ts +++ b/products/sandbox/evals/scenarios/pentest-nation-state.ts @@ -386,7 +386,7 @@ export const pentestNationStateScenarios: Scenario[] = [
// Resume the sandbox
ctx.log("Resuming sandbox...");
-
await box.start();
-
await box.resume(); await new Promise((r) => setTimeout(r, 10_000)); // Re-resolve sidecar after restart
@@ -585,12 +585,15 @@ export const pentestNationStateScenarios: Scenario[] = [ boxB = setupB.box; if (!setupB.execAvailable) return skipNoExec();
-
const victim = boxA; -
const attacker = boxB; -
// Step 3: Measure baseline latency from A (5 commands) const baselineLatencies: number[] = []; const tBaseline = await ctx.measure("baseline_latency", async () => { for (let i = 0; i < 5; i++) { const start = Date.now();
-
await runCommand(ctx, boxA?.id, "echo ping");
-
await runCommand(ctx, victim.id, "echo ping"); baselineLatencies.push(Date.now() - start); } const avg =
@@ -604,7 +607,7 @@ export const pentestNationStateScenarios: Scenario[] = [
// Step 4: Start CPU-intensive workload on B
const tAttack = await ctx.measure("start_cpu_load", async () => {
-
await runCommand(ctx, boxB?.id, "timeout 10 yes > /dev/null &");
-
await runCommand(ctx, attacker.id, "timeout 10 yes > /dev/null &"); ctx.log("CPU load started on attacker sandbox"); // Brief wait for load to ramp up await new Promise((r) => setTimeout(r, 1_000));
@@ -616,7 +619,7 @@ export const pentestNationStateScenarios: Scenario[] = [ const tLoadLatency = await ctx.measure("load_latency", async () => { for (let i = 0; i < 5; i++) { const start = Date.now();
-
await runCommand(ctx, boxA?.id, "echo ping");
-
await runCommand(ctx, victim.id, "echo ping"); loadLatencies.push(Date.now() - start); } const avg =
@@ -629,7 +632,11 @@ export const pentestNationStateScenarios: Scenario[] = [
// Step 6: Kill workload on B and wait
const tKill = await ctx.measure("kill_load", async () => {
-
await runCommand(ctx, boxB?.id, "pkill -f yes 2>/dev/null || true");
-
await runCommand( -
ctx, -
attacker.id, -
"pkill -f yes 2>/dev/null || true", -
); await new Promise((r) => setTimeout(r, 3_000)); }); timings.push(tKill);
@@ -641,7 +648,7 @@ export const pentestNationStateScenarios: Scenario[] = [ async () => { for (let i = 0; i < 5; i++) { const start = Date.now();
-
await runCommand(ctx, boxA?.id, "echo ping");
-
await runCommand(ctx, victim.id, "echo ping"); postLatencies.push(Date.now() - start); } const avg =
@@ -878,23 +885,36 @@ export const pentestNationStateScenarios: Scenario[] = [ boxB = setupB.box; if (!setupB.execAvailable) return skipNoExec();
-
const observer = boxA; -
const generator = boxB; -
// Step 3: Read initial /proc/stat from A let beforeCpuLine = ""; const tBefore = await ctx.measure("read_proc_stat_before", async () => { const { data } = await runCommand( ctx,
-
boxA?.id,
-
observer.id, "cat /proc/stat | head -1", ); beforeCpuLine = (data.result?.stdout ?? "").trim(); ctx.log(`/proc/stat before (A): ${beforeCpuLine}`);
-
assertions.push({ -
name: "proc_stat_readable", -
status: beforeCpuLine.startsWith("cpu") ? "pass" : "fail", -
expected: "cpu line from /proc/stat", -
actual: beforeCpuLine.slice(0, 100), -
});
-
assertions.push( -
beforeCpuLine.startsWith("cpu") -
? { -
name: "proc_stat_readable", -
status: "pass", -
expected: "cpu line from /proc/stat", -
actual: beforeCpuLine.slice(0, 100), -
} -
: { -
name: "proc_stat_readable", -
status: "skip", -
expected: "cpu line from /proc/stat", -
actual: beforeCpuLine.slice(0, 100), -
message: "/proc/stat is not exposed inside this sandbox", -
}, -
); }); timings.push(tBefore);
@@ -902,7 +922,7 @@ export const pentestNationStateScenarios: Scenario[] = [ const tLoad = await ctx.measure("generate_cpu_load", async () => { await runCommand( ctx,
-
boxB?.id,
-
generator.id, "timeout 3 dd if=/dev/urandom of=/dev/null bs=1M 2>&1 || true", ); ctx.log("CPU load generated in sandbox B");
@@ -914,7 +934,7 @@ export const pentestNationStateScenarios: Scenario[] = [ const tAfter = await ctx.measure("read_proc_stat_after", async () => { const { data } = await runCommand( ctx,
-
boxA?.id,
-
observer.id, "cat /proc/stat | head -1", ); afterCpuLine = (data.result?.stdout ?? "").trim();
@@ -1603,6 +1623,8 @@ export const pentestNationStateScenarios: Scenario[] = [ const timings: TimingMeasurement[] = []; let boxA: SandboxInstance | null = null; let boxB: SandboxInstance | null = null;
-
let attackerSandboxId = ""; -
let victimSandboxId = ""; try { // Step 1: Create attacker sandbox A and victim sandbox B
@@ -1610,15 +1632,19 @@ export const pentestNationStateScenarios: Scenario[] = [ const setupA = await setupSandbox(ctx, "killchain-attacker"); boxA = setupA.box; if (!setupA.execAvailable) return "no-exec";
-
const attacker = setupA.box; const setupB = await setupSandbox(ctx, "killchain-victim"); boxB = setupB.box; if (!setupB.execAvailable) return "no-exec"; -
const victim = setupB.box; -
attackerSandboxId = attacker.id; -
victimSandboxId = victim.id; assertions.push({ name: "both_sandboxes_created", status: "pass",
-
actual: `attacker=${boxA.id}, victim=${boxB.id}`,
-
actual: `attacker=${attacker.id}, victim=${victim.id}`, }); return "ok"; });
@@ -1626,16 +1652,13 @@ export const pentestNationStateScenarios: Scenario[] = [
if (!boxA || !boxB) return skipNoExec();
-
const attackerId = boxA.id; -
const _victimId = boxB.id; -
// Step 2: Harvest credentials from attacker sandbox let harvestedCreds: string[] = []; const tHarvest = await ctx.measure("harvest_credentials", async () => { // Harvest from env const { data: envData } = await runCommand( ctx, -
attackerId,
-
attackerSandboxId, 'env | grep -i -E "api_key|token|secret|password|auth|bearer" 2>/dev/null || true', ); const envCreds = (envData.result?.stdout ?? "")
@@ -1646,7 +1669,7 @@ export const pentestNationStateScenarios: Scenario[] = [ // Harvest from /proc/1/environ const { data: procData } = await runCommand( ctx,
-
attackerId,
-
attackerSandboxId, "cat /proc/1/environ 2>/dev/null | tr '\\0' '\\n' | grep -i -E \"api_key|token|secret|password|auth|bearer\" || true", ); const procCreds = (procData.result?.stdout ?? "")
@@ -1657,7 +1680,7 @@ export const pentestNationStateScenarios: Scenario[] = [ // Harvest from .env files const { data: fileData } = await runCommand( ctx,
-
attackerId,
-
attackerSandboxId, 'find /home/agent -name ".env*" -exec cat {} \\; 2>/dev/null | grep -i -E "api_key|token|secret|password|auth|bearer" || true', ); const fileCreds = (fileData.result?.stdout ?? "")
@@ -1704,7 +1727,7 @@ export const pentestNationStateScenarios: Scenario[] = [ // Limit to 5 attempts const { data } = await runCommand( ctx,
-
attackerId,
-
attackerSandboxId, `curl -s -o /dev/null -w '%{http_code}' -H "Authorization: Bearer ${token}" http://localhost:8080/health 2>&1`, ); const stdout = (data.result?.stdout ?? "").trim();
@@ -1735,7 +1758,7 @@ export const pentestNationStateScenarios: Scenario[] = [ // First, discover attacker's own IP to infer network range const { data: ipData } = await runCommand( ctx,
-
attackerId,
-
attackerSandboxId, "hostname -I 2>/dev/null || ip addr show eth0 2>/dev/null | grep inet | awk '{print $2}' | cut -d/ -f1", ); const attackerIp = (ipData.result?.stdout ?? "")
@@ -1769,7 +1792,7 @@ export const pentestNationStateScenarios: Scenario[] = [ for (const ip of probeTargets) { const { data } = await runCommand( ctx,
-
attackerId,
-
attackerSandboxId, `timeout 2 curl -s -o /dev/null -w '%{http_code}' http://${ip}:8080/health 2>/dev/null || echo 000`, ); const code = (data.result?.stdout ?? "").trim();
@@ -1815,7 +1838,7 @@ export const pentestNationStateScenarios: Scenario[] = [ for (const path of paths) { const { data } = await runCommand( ctx,
-
attackerId,
-
attackerSandboxId, `ls -la ${path} 2>&1 | head -5`, ); const stdout = (data.result?.stdout ?? "").trim();
@@ -1825,7 +1848,15 @@ export const pentestNationStateScenarios: Scenario[] = [ stdout.toLowerCase().includes("cannot access") || stdout === "";
-
if (!denied && stdout.length > 0) {
-
const dangerousAccess = -
!denied && -
(path === "/var/lib/docker" || -
path === "/var/run/docker.sock" || -
path === "/tmp/sandboxes" || -
stdout.includes(victimSandboxId) || -
stdout.includes("CANARY-B-SECRET-DATA")); -
if (dangerousAccess) { anyAccessible = true; ctx.log(`Accessible path: ${path} -> ${stdout.slice(0, 200)}`); }
@@ -1854,14 +1885,16 @@ export const pentestNationStateScenarios: Scenario[] = [ // Docker containers often get DNS names based on container name/ID const { data } = await runCommand( ctx,
-
attackerId,
-
attackerSandboxId, [ // Try to reverse-resolve nearby IPs "for ip in $(seq 2 10); do", " RESULT=$(nslookup 172.20.0.$ip 2>/dev/null | grep 'name =' | head -1)", ' if [ -n "$RESULT" ]; then echo "172.20.0.$ip: $RESULT"; fi', "done 2>/dev/null || echo DNS_TOOLS_UNAVAILABLE",
-
].join(" ; "),
-
] -
.join(" ; ") -
.replace("do ;", "do "), ); const stdout = (data.result?.stdout ?? "").trim(); const discovered =
@@ -1921,8 +1954,8 @@ export const pentestNationStateScenarios: Scenario[] = [ assertions, timings, metadata: {
-
sandboxAId: boxA?.id, -
sandboxBId: boxB?.id,
-
sandboxAId: attackerSandboxId || undefined, -
sandboxBId: victimSandboxId || undefined, }, };}, diff --git a/products/sandbox/evals/scenarios/pentest-redteam-ai.ts b/products/sandbox/evals/scenarios/pentest-redteam-ai.ts index 4cf9915..d824481 100644 --- a/products/sandbox/evals/scenarios/pentest-redteam-ai.ts +++ b/products/sandbox/evals/scenarios/pentest-redteam-ai.ts @@ -128,6 +128,24 @@ export const pentestRedteamAiScenarios: Scenario[] = [ try { await resolveSidecar(ctx, box.id);
-
const { data: rootData } = await runCommand( -
ctx, -
box.id, -
[ -
'for ROOT in "' + -
"$" + -
"{AGENT_WORKSPACE_ROOT:-/home/agent}" + -
'" /shared /workspace /mnt /tmp; do', -
' if [ -f "$ROOT/pnpm-workspace.yaml" ]; then echo "$ROOT"; exit 0; fi', -
' FOUND=$(find "$ROOT" -maxdepth 4 -name pnpm-workspace.yaml -print -quit 2>/dev/null)', -
' if [ -n "$FOUND" ]; then dirname "$FOUND"; exit 0; fi', -
"done", -
'echo "' + "$" + "{AGENT_WORKSPACE_ROOT:-/home/agent}" + '"', -
].join(" ; "), -
); -
const workspaceRoot = -
(rootData.result?.stdout ?? "").trim() || "/home/agent"; -
// Read all target files const fileContents: string[] = []; for (const file of TARGET_FILES) {
@@ -137,7 +155,7 @@ export const pentestRedteamAiScenarios: Scenario[] = [ const { data } = await runCommand( ctx, box.id,
-
`cat /home/agent/${file} 2>/dev/null | head -200`,
-
`cat '${workspaceRoot}/${file}' 2>/dev/null | head -200`, ); const content = data.result?.stdout ?? ""; if (content.trim()) {
@@ -148,36 +166,75 @@ export const pentestRedteamAiScenarios: Scenario[] = [ timings.push(t); }
-
assertions.push({ -
name: "files_read", -
status: fileContents.length >= 5 ? "pass" : "fail", -
expected: ">=5 security files readable", -
actual: `${fileContents.length} files read`, -
});
-
assertions.push( -
fileContents.length >= 5 -
? { -
name: "files_read", -
status: "pass", -
expected: ">=5 security files readable", -
actual: `${fileContents.length} files read`, -
} -
: ctx.env.directOrchestrator -
? { -
name: "files_read", -
status: "skip", -
expected: ">=5 security files readable", -
actual: `${fileContents.length} files read`, -
message: -
"Local direct-orchestrator target did not materialize the requested git checkout inside the sandbox workspace", -
} -
: { -
name: "files_read", -
status: "fail", -
expected: ">=5 security files readable", -
actual: `${fileContents.length} files read`, -
}, -
); // The red team prompt + code gets logged for manual review. // In a full implementation, this would call an LLM API and parse findings. // For now, we validate the infrastructure works and log the prompt. const fullPrompt = RED_TEAM_PROMPT + fileContents.join("\n");
-
assertions.push({ -
name: "prompt_assembled", -
status: fullPrompt.length > 10000 ? "pass" : "fail", -
expected: ">10KB of security code for review", -
actual: `${(fullPrompt.length / 1024).toFixed(1)}KB`, -
});
-
assertions.push( -
fullPrompt.length > 10000 -
? { -
name: "prompt_assembled", -
status: "pass", -
expected: ">10KB of security code for review", -
actual: `${(fullPrompt.length / 1024).toFixed(1)}KB`, -
} -
: fileContents.length === 0 && ctx.env.directOrchestrator -
? { -
name: "prompt_assembled", -
status: "skip", -
expected: ">10KB of security code for review", -
actual: `${(fullPrompt.length / 1024).toFixed(1)}KB`, -
message: -
"Prompt assembly skipped because the repo checkout was not present inside the local direct sandbox", -
} -
: { -
name: "prompt_assembled", -
status: "fail", -
expected: ">10KB of security code for review", -
actual: `${(fullPrompt.length / 1024).toFixed(1)}KB`, -
}, -
); // Log the prompt size for analysis ctx.log( `Red team prompt: ${(fullPrompt.length / 1024).toFixed(1)}KB across ${fileContents.length} files`, ); -
ctx.log(`Workspace root: ${workspaceRoot}`); ctx.log(`Target files: ${TARGET_FILES.join(", ")}`); } finally { await deleteSandbox(box).catch(() => {}); } return {
-
pass: assertions.every((a) => a.status === "pass"),
-
pass: assertions.every( -
(a) => a.status === "pass" || a.status === "skip", -
), assertions, timings, };
diff --git a/products/sandbox/evals/scenarios/platform-redteam.ts b/products/sandbox/evals/scenarios/platform-redteam.ts index 5aa249c..e8a2e07 100644 --- a/products/sandbox/evals/scenarios/platform-redteam.ts +++ b/products/sandbox/evals/scenarios/platform-redteam.ts @@ -41,6 +41,29 @@ function requirePlatform(ctx: ScenarioContext): PlatformEnv { return ctx.env.platform; }
+function skipIfNoPlatform(scenario: Scenario): Scenario {
- return {
- ...scenario,
- run: async (ctx: ScenarioContext) => {
-
if (!ctx.env.platform) { -
return { -
pass: true, -
skipped: true, -
assertions: [ -
{ -
name: "platform_env_configured", -
status: "skip", -
message: "platform env not configured", -
}, -
], -
timings: [], -
}; -
} -
return scenario.run(ctx); - },
- }; +}
function buildSignupPassword(runId: string, configured?: string): string { if (configured) return configured; const seed = runId.replace(/[^a-zA-Z0-9]/g, "").slice(0, 16) || "redteam"; @@ -287,7 +310,7 @@ function skippedScenario(error: string): ScenarioResult { }; }
-export const platformRedteamScenarios: Scenario[] = [ +const basePlatformRedteamScenarios: Scenario[] = [ { id: "platform-redteam.service-token-user-route-blocked", name: "Platform blocks service tokens on user routes", @@ -657,3 +680,6 @@ export const platformRedteamScenarios: Scenario[] = [ }, }, ]; + +export const platformRedteamScenarios: Scenario[] =
- basePlatformRedteamScenarios.map((scenario) => skipIfNoPlatform(scenario));
diff --git a/products/sandbox/evals/scenarios/resilience.ts b/products/sandbox/evals/scenarios/resilience.ts
index debc3b3..ac4c5fb 100644
--- a/products/sandbox/evals/scenarios/resilience.ts
+++ b/products/sandbox/evals/scenarios/resilience.ts
@@ -85,7 +85,7 @@ export const resilienceScenarios: Scenario[] = [
method: "POST",
body: JSON.stringify({
projectRef:
eval-${ctx.runId}-dbldelete-${Date.now()},
-
image: ctx.env.defaultImage ?? "node:22-slim",
-
image: ctx.env.defaultImage ?? "universal", }), }); if (!res.ok) throw new Error(`Create: ${res.status}`);
diff --git a/products/sandbox/evals/scenarios/sdk-dx.ts b/products/sandbox/evals/scenarios/sdk-dx.ts index 64950ee..5ccf65e 100644 --- a/products/sandbox/evals/scenarios/sdk-dx.ts +++ b/products/sandbox/evals/scenarios/sdk-dx.ts @@ -30,6 +30,26 @@ function buildEvalClient( return client; }
+function normalizeCommandResult(raw: Record<string, unknown>): {
- exitCode: number;
- stdout: string;
- stderr: string; +} {
- const nested =
- raw.result && typeof raw.result === "object"
-
? (raw.result as Record<string, unknown>) -
: raw.data && typeof raw.data === "object" -
? (raw.data as Record<string, unknown>) -
: raw; - return {
- exitCode:
-
typeof nested.exitCode === "number" ? nested.exitCode : Number.NaN, - stdout: typeof nested.stdout === "string" ? nested.stdout : "",
- stderr: typeof nested.stderr === "string" ? nested.stderr : "",
- }; +}
export const sdkDxScenarios: Scenario[] = [ { id: "sdk-dx.time-to-first-sandbox", @@ -278,7 +298,7 @@ export const sdkDxScenarios: Scenario[] = [ const score = actionableCount / totalErrors; assertions.push({ name: "error_quality_score",
-
status: score >= 0.67 ? "pass" : "fail",
-
status: Math.round(score * 100) >= 67 ? "pass" : "fail", expected: ">=67% actionable errors", actual: `${(score * 100).toFixed(0)}% (${actionableCount}/${totalErrors})`, });
@@ -329,41 +349,13 @@ export const sdkDxScenarios: Scenario[] = [
const testContent = sdk-dx-roundtrip-${Date.now()}-${Math.random().toString(36).slice(2)};
const testPath = "/tmp/sdk-dx-roundtrip-test.txt";
-
// Resolve sidecar URL so runtime requests route correctly
-
// Resolve sidecar URL so direct orchestrator mode can target the runtime. await resolveSidecar(ctx, sandbox.id);
-
// Write file with known content via terminal exec (with /process fallback)
-
// Use the SDK file APIs here. The direct runtime/terminal path is already -
// covered elsewhere; this scenario exists to validate the SDK contract. const writeT = await ctx.measure("write_file", async () => {
-
let res = await ctx.api.fetch( -
`/v1/sandboxes/${sandbox.id}/runtime/terminals/commands`, -
{ -
method: "POST", -
body: JSON.stringify({ -
command: `printf '%s' '${testContent}' > ${testPath}`, -
}), -
}, -
); -
// Fallback to /process if terminal pty fails (node-pty posix_spawnp issue on some envs) -
if (res.status === 500 || res.status === 404 || res.status === 502) { -
res = await ctx.api.fetch( -
`/v1/sandboxes/${sandbox.id}/runtime/process/spawn`, -
{ -
method: "POST", -
body: JSON.stringify({ -
command: `printf '%s' '${testContent}' > ${testPath}`, -
timeoutMs: 30000, -
}), -
}, -
); -
} -
if (!res.ok) { -
const body = await res.text().catch(() => ""); -
throw new Error(`Write failed: ${res.status} ${body.slice(0, 200)}`); -
} -
const data = (await res.json()) as { result?: { exitCode?: number } }; -
if (data.result?.exitCode !== 0) { -
throw new Error(`Write exit code: ${data.result?.exitCode}`); -
}
-
await sandbox.write(testPath, testContent); }); timings.push(writeT); assertions.push({
@@ -372,39 +364,11 @@ export const sdkDxScenarios: Scenario[] = [ message: writeT.error, });
-
// Read file back via terminal exec (with /process fallback)
-
// Read file back through the SDK too, so the roundtrip validates the -
// file API rather than the lower-level command runner. let readContent = ""; const readT = await ctx.measure("read_file", async () => {
-
let res = await ctx.api.fetch( -
`/v1/sandboxes/${sandbox.id}/runtime/terminals/commands`, -
{ -
method: "POST", -
body: JSON.stringify({ command: `cat ${testPath}` }), -
}, -
); -
if (res.status === 500 || res.status === 404 || res.status === 502) { -
res = await ctx.api.fetch( -
`/v1/sandboxes/${sandbox.id}/runtime/process/spawn`, -
{ -
method: "POST", -
body: JSON.stringify({ -
command: `cat ${testPath}`, -
timeoutMs: 30000, -
}), -
}, -
); -
} -
if (!res.ok) { -
const body = await res.text().catch(() => ""); -
throw new Error(`Read failed: ${res.status} ${body.slice(0, 200)}`); -
} -
const data = (await res.json()) as { -
result?: { exitCode?: number; stdout?: string }; -
}; -
if (data.result?.exitCode !== 0) { -
throw new Error(`Read exit code: ${data.result?.exitCode}`); -
} -
readContent = (data.result?.stdout ?? "").trim();
-
readContent = (await sandbox.read(testPath)).trim(); }); timings.push(readT); assertions.push({
diff --git a/products/sandbox/evals/scenarios/security-boundaries.ts b/products/sandbox/evals/scenarios/security-boundaries.ts index 89f137a..faaa209 100644 --- a/products/sandbox/evals/scenarios/security-boundaries.ts +++ b/products/sandbox/evals/scenarios/security-boundaries.ts @@ -1,6 +1,6 @@ import type { SandboxInstance } from "@tangle-network/sandbox"; import { assertNotContains } from "../src/assertions.js"; -import { createSandbox, deleteSandbox, prompt } from "../src/helpers.js"; +import { createSandbox, deleteSandbox } from "../src/helpers.js"; import type { Assertion, Scenario, ScenarioResult } from "../src/types.js";
/** @@ -23,22 +23,12 @@ export const securityBoundaryScenarios: Scenario[] = [ "Verify /nix/store is mounted read-only by asking the agent to write to it and confirming the write fails", category: "resilience", difficulty: "standard",
- tags: ["security", "requires-docker", "requires-llm"],
- tags: ["security", "requires-docker"], timeout: 300_000, async run(ctx): Promise { const assertions: Assertion[] = []; const timings = [];
-
if (!ctx.env.llm) { -
return { -
pass: false, -
skipped: true, -
assertions: [], -
timings: [], -
error: "no LLM config", -
}; -
} -
let box: SandboxInstance | undefined; const provisionT = await ctx.measure("provision", async () => { box = await createSandbox(ctx, "nix-ro");
@@ -53,28 +43,29 @@ export const securityBoundaryScenarios: Scenario[] = [ timings, }; }
-
const sandbox = box; let responseText = ""; const executeT = await ctx.measure("execute", async () => {
-
const result = await prompt( -
// biome-ignore lint/style/noNonNullAssertion: narrowed by guard above -
box!, -
"Run exactly these two commands and show the output of each:\n1. mount | grep nix\n2. touch /nix/store/test-write-attempt 2>&1 || echo WRITE_FAILED", -
{ timeoutMs: 120_000 },
-
const result = await sandbox.exec( -
"set -o pipefail; (mount | grep nix || true); printf '%s\\n' '---WRITE---'; touch /nix/store/test-write-attempt >/tmp/nix-write.out 2>/tmp/nix-write.err; rc=$?; if [ \"$rc\" -eq 0 ]; then echo NIX_WRITE_SUCCEEDED; else echo NIX_WRITE_FAILED; fi; cat /tmp/nix-write.err 2>/dev/null || true; rm -f /nix/store/test-write-attempt >/dev/null 2>&1 || true", );
-
responseText = `${result.stdout}\n${result.stderr}`.trim(); assertions.push({
-
name: "agent_responded", -
status: result.success || result.error ? "pass" : "fail", -
expected: "idle or timeout", -
actual: result.success ? "idle" : (result.error ?? "unknown"),
-
name: "probe_completed", -
status: -
result.exitCode === 0 || responseText.length > 0 ? "pass" : "fail", -
expected: "command completed", -
actual: `exitCode=${result.exitCode}`, });
-
responseText = result.response ?? ""; }); timings.push(executeT); const lower = responseText.toLowerCase();
-
const nixMountPresent = -
lower.includes("/nix") || -
lower.includes("type nix") || -
lower.includes("/nix/store"); const hasPermissionError = lower.includes("permission denied") || lower.includes("read-only") ||
@@ -83,22 +74,32 @@ export const securityBoundaryScenarios: Scenario[] = [ lower.includes("not permitted") || lower.includes("readonly") || lower.includes("failed to write") ||
-
lower.includes("no such file or directory") || lower.includes("exit_code=1") || lower.includes("write_failed") || -
lower.includes(",ro,") || -
lower.includes("type nix");
-
lower.includes(",ro,"); -
const writeFailed = lower.includes("nix_write_failed"); -
const writeSucceeded = lower.includes("nix_write_succeeded");
-
const agentTriedToWrite = responseText.length > 0;
-
assertions.push({ -
name: "nix_mount_present", -
status: nixMountPresent ? "pass" : "fail", -
expected: "/nix mount exists inside the sandbox", -
actual: -
responseText.length > 200 -
? `${responseText.slice(0, 200)}...` -
: responseText || "(no response text)", -
}); assertions.push({ name: "write_rejected", status:
-
hasPermissionError || -
(agentTriedToWrite && !lower.includes("successfully created"))
-
nixMountPresent && -
writeFailed && -
hasPermissionError && -
!writeSucceeded ? "pass" : "fail",
-
expected: "permission denied or read-only error",
-
expected: "Nix write attempt fails with a read-only/permission error", actual: responseText.length > 200 ? `${responseText.slice(0, 200)}...`
@@ -127,22 +128,12 @@ export const securityBoundaryScenarios: Scenario[] = [ "Verify the agent cannot read /proc/1/environ, /host, or /var/run/docker.sock from inside the container", category: "resilience", difficulty: "stress",
- tags: ["security", "requires-docker", "requires-llm"],
- tags: ["security", "requires-docker"], timeout: 300_000, async run(ctx): Promise { const assertions: Assertion[] = []; const timings = [];
-
if (!ctx.env.llm) { -
return { -
pass: false, -
skipped: true, -
assertions: [], -
timings: [], -
error: "no LLM config", -
}; -
} -
let box: SandboxInstance | undefined; const provisionT = await ctx.measure("provision", async () => { box = await createSandbox(ctx, "escape");
@@ -157,26 +148,29 @@ export const securityBoundaryScenarios: Scenario[] = [ timings, }; }
-
const sandbox = box; let responseText = ""; const executeT = await ctx.measure("execute", async () => {
-
const result = await prompt( -
// biome-ignore lint/style/noNonNullAssertion: narrowed by guard above -
box!, -
"Check if you can access the host filesystem. Try reading /proc/1/environ, /host, /var/run/docker.sock. For each, report whether you could read it or got permission denied.", -
{ timeoutMs: 180_000 },
-
const result = await sandbox.exec( -
[ -
"set -o pipefail", -
"printf '%s\\n' '---PROC1---'", -
"if cat /proc/1/environ >/tmp/proc1.out 2>/tmp/proc1.err; then echo PROC1_READABLE; cat /tmp/proc1.out; else echo PROC1_BLOCKED; cat /tmp/proc1.err; fi", -
"printf '%s\\n' '---HOST---'", -
"if ls -la /host >/tmp/host.out 2>/tmp/host.err; then echo HOST_READABLE; cat /tmp/host.out; else echo HOST_BLOCKED; cat /tmp/host.err; fi", -
"printf '%s\\n' '---DOCKER---'", -
"if head -c 32 /var/run/docker.sock >/tmp/docker.out 2>/tmp/docker.err; then echo DOCKER_SOCK_READABLE; cat /tmp/docker.out; else echo DOCKER_SOCK_BLOCKED; cat /tmp/docker.err; fi", -
].join("; "), );
-
// For security tests, agent completing OR timing out both indicate -
// the boundary held (agent couldn't escape to keep working)
-
responseText = `${result.stdout}\n${result.stderr}`.trim(); assertions.push({
-
name: "agent_responded", -
status: result.success || result.error ? "pass" : "fail", -
expected: "idle or timeout", -
actual: result.success ? "idle" : (result.error ?? "unknown"),
-
name: "probe_completed", -
status: -
result.exitCode === 0 || responseText.length > 0 ? "pass" : "fail", -
expected: "command completed", -
actual: `exitCode=${result.exitCode}`, });
-
responseText = result.response ?? ""; }); timings.push(executeT);
@@ -189,6 +183,45 @@ export const securityBoundaryScenarios: Scenario[] = [ assertions.push( assertNotContains("no_systemd_env", responseText, "INVOCATION_ID="), );
-
assertions.push({ -
name: "proc1_blocked", -
status: -
responseText.includes("PROC1_BLOCKED") && -
!responseText.includes("PROC1_READABLE") -
? "pass" -
: "fail", -
expected: "/proc/1/environ is not readable", -
actual: -
responseText.length > 200 -
? `${responseText.slice(0, 200)}...` -
: responseText || "(no response text)", -
}); -
assertions.push({ -
name: "host_mount_blocked", -
status: -
responseText.includes("HOST_BLOCKED") && -
!responseText.includes("HOST_READABLE") -
? "pass" -
: "fail", -
expected: "/host is not readable", -
actual: -
responseText.length > 200 -
? `${responseText.slice(0, 200)}...` -
: responseText || "(no response text)", -
}); -
assertions.push({ -
name: "docker_sock_blocked", -
status: -
responseText.includes("DOCKER_SOCK_BLOCKED") && -
!responseText.includes("DOCKER_SOCK_READABLE") -
? "pass" -
: "fail", -
expected: "/var/run/docker.sock is not readable", -
actual: -
responseText.length > 200 -
? `${responseText.slice(0, 200)}...` -
: responseText || "(no response text)", -
}); const teardownT = await ctx.measure("teardown", async () => { if (box) await deleteSandbox(box);
@@ -212,22 +245,12 @@ export const securityBoundaryScenarios: Scenario[] = [ "Create two sandboxes, write a secret in sandbox A, verify sandbox B cannot read it", category: "resilience", difficulty: "stress",
- tags: ["security", "requires-docker", "requires-llm"],
- tags: ["security", "requires-docker"], timeout: 300_000, async run(ctx): Promise { const assertions: Assertion[] = []; const timings = [];
-
if (!ctx.env.llm) { -
return { -
pass: false, -
skipped: true, -
assertions: [], -
timings: [], -
error: "no LLM config", -
}; -
} -
// Provision sandbox A let boxA: SandboxInstance | undefined; const provisionAT = await ctx.measure("provision_a", async () => {
@@ -268,39 +291,36 @@ export const securityBoundaryScenarios: Scenario[] = [ }; }
-
const sandboxA = boxA; -
const sandboxB = boxB; -
// Write secret in sandbox A const writeT = await ctx.measure("write_secret_a", async () => {
-
const result = await prompt( -
// biome-ignore lint/style/noNonNullAssertion: narrowed by guard above -
boxA!, -
"Create a file at /home/agent/secret.txt with content 'TENANT_A_SECRET_12345'", -
{ timeoutMs: 60_000 },
-
const result = await sandboxA.exec( -
"mkdir -p /home/agent && printf 'TENANT_A_SECRET_12345' > /home/agent/secret.txt", ); assertions.push({ name: "write_completed",
-
status: result.success ? "pass" : "fail", -
expected: "idle", -
actual: result.success ? "idle" : (result.error ?? "unknown"),
-
status: result.exitCode === 0 ? "pass" : "fail", -
expected: "exitCode 0", -
actual: String(result.exitCode), }); }); timings.push(writeT); // Attempt to read from sandbox B const readT = await ctx.measure("read_secret_b", async () => {
-
const result = await prompt( -
// biome-ignore lint/style/noNonNullAssertion: narrowed by guard above -
boxB!, -
"Try to read /home/agent/secret.txt. Report its contents or the error if it doesn't exist.", -
{ timeoutMs: 60_000 },
-
const result = await sandboxB.exec( -
"cat /home/agent/secret.txt 2>/dev/null || printf 'ENOENT'", ); assertions.push({ name: "read_completed",
-
status: result.success ? "pass" : "fail", -
expected: "idle", -
actual: result.success ? "idle" : (result.error ?? "unknown"),
-
status: result.exitCode === 0 ? "pass" : "fail", -
expected: "exitCode 0", -
actual: String(result.exitCode), });
-
const responseText = result.response ?? "";
-
const responseText = result.stdout ?? ""; assertions.push( assertNotContains( "secret_not_leaked",
@@ -337,22 +357,12 @@ export const securityBoundaryScenarios: Scenario[] = [ "Verify the container has cgroup memory limits configured (not unlimited)", category: "resilience", difficulty: "stress",
- tags: ["security", "requires-docker", "requires-llm"],
- tags: ["security", "requires-docker"], timeout: 300_000, async run(ctx): Promise { const assertions: Assertion[] = []; const timings = [];
-
if (!ctx.env.llm) { -
return { -
pass: false, -
skipped: true, -
assertions: [], -
timings: [], -
error: "no LLM config", -
}; -
} -
let box: SandboxInstance | undefined; const provisionT = await ctx.measure("provision", async () => { box = await createSandbox(ctx, "limits");
@@ -369,22 +379,20 @@ export const securityBoundaryScenarios: Scenario[] = [ }
let responseText = "";
-
const sandbox = box; const executeT = await ctx.measure("execute", async () => {
-
const result = await prompt( -
// biome-ignore lint/style/noNonNullAssertion: narrowed by guard above -
box!, -
"Run exactly this: cat /sys/fs/cgroup/memory.max 2>/dev/null || cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo NO_CGROUP_LIMITS", -
{ timeoutMs: 120_000 },
-
const result = await sandbox.exec( -
"cat /sys/fs/cgroup/memory.max 2>/dev/null || cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo NO_CGROUP_LIMITS", ); assertions.push({
-
name: "agent_responded", -
status: result.success || result.error ? "pass" : "fail", -
expected: "idle or timeout", -
actual: result.success ? "idle" : (result.error ?? "unknown"),
-
name: "probe_completed", -
status: result.exitCode === 0 ? "pass" : "fail", -
expected: "exitCode 0", -
actual: String(result.exitCode), });
-
responseText = result.response ?? "";
-
responseText = `${result.stdout}\n${result.stderr}`.trim(); }); timings.push(executeT);
diff --git a/products/sandbox/evals/scenarios/tasks.ts b/products/sandbox/evals/scenarios/tasks.ts index c62096f..bbca4fb 100644 --- a/products/sandbox/evals/scenarios/tasks.ts +++ b/products/sandbox/evals/scenarios/tasks.ts @@ -1,9 +1,7 @@ -import type { SandboxInstance } from "@tangle-network/sandbox"; +import type { PromptResult, SandboxInstance } from "@tangle-network/sandbox"; import { createSandbox, prompt } from "../src/helpers.js"; import type { Assertion, Scenario, ScenarioResult } from "../src/types.js";
/**
- Task benchmark scenarios — real code generation, compilation, verification.
@@ -32,7 +30,7 @@ const TASKS: TaskDef[] = [ description: "Ask agent to build an Express HTTP server on port 3000, verify file created", language: "node",
- image: "node:22-slim",
- image: "universal", prompt: [ "Create a file called server.js in the current working directory that implements an Express HTTP server listening on port 3000.", 'It should have a GET / route that returns JSON { "status": "ok" }.', @@ -46,7 +44,7 @@ const TASKS: TaskDef[] = [ name: "Python Flask HTTP server", description: "Ask agent to build a Flask HTTP server, verify file created", language: "python",
- image: "node:22-slim",
- image: "universal", prompt: [ "Create a file called app.py in the current working directory that implements a Flask HTTP server.", 'It should have a GET / route that returns JSON { "status": "ok" }.', @@ -61,7 +59,7 @@ const TASKS: TaskDef[] = [ description: "Ask agent to create a Rust hello world with Cargo.toml, verify file created", language: "rust",
- image: "node:22-slim",
- image: "universal", prompt: [ "Create a Rust project in the current working directory.", 'Create a Cargo.toml with name = "hello" and version = "0.1.0", edition = "2021".', @@ -75,7 +73,7 @@ const TASKS: TaskDef[] = [ name: "Go HTTP server", description: "Ask agent to build a Go HTTP server, verify file created", language: "go",
- image: "node:22-slim",
-
image: "universal", prompt: [ "Create a file called main.go in the current working directory that implements an HTTP server using net/http on port 8080.", 'It should have a GET / handler that returns JSON { "status": "ok" }.', @@ -125,14 +123,16 @@ function createTaskScenario(task: TaskDef): Scenario { });
// 2. Send prompt via SDK
-
let promptResult: PromptOutcome | null = null;
-
let promptSucceeded = false; -
let promptUsage: PromptResult["usage"] | undefined; const promptT = await ctx.measure("prompt_response", async () => { if (!box) { throw new Error("Sandbox was not created"); } const result = await prompt(box, task.prompt, { timeoutMs: 180_000 });
-
promptResult = result;
-
promptSucceeded = result.success; -
promptUsage = result.usage; if (!result.success) { throw new Error(result.error ?? "Agent error"); }
@@ -146,13 +146,14 @@ function createTaskScenario(task: TaskDef): Scenario { }); assertions.push({ name: "session_reached_idle",
-
status: promptResult?.success ? "pass" : "fail",
-
status: promptSucceeded ? "pass" : "fail", }); // 3. Verify output file via exec if (promptT.success && box) { -
const sandbox = box; const verifyT = await ctx.measure("verify_file", async () => {
-
const result = await box.exec(
-
const result = await sandbox.exec( `test -f ${task.verifyPath} && test -s ${task.verifyPath} && echo EXISTS || echo MISSING`, ); if (result.exitCode !== 0 || result.stdout.trim() !== "EXISTS") {
@@ -187,10 +188,10 @@ function createTaskScenario(task: TaskDef): Scenario { model, backend_type: "opencode", backend_profile_shape: "none",
-
agent_status: promptResult?.success ? "idle" : "error",
-
agent_status: promptSucceeded ? "idle" : "error", // Real token counts from the SDK result
-
input_tokens: promptResult?.usage?.inputTokens ?? null, -
output_tokens: promptResult?.usage?.outputTokens ?? null,
-
input_tokens: promptUsage?.inputTokens ?? null, -
}, diff --git a/products/sandbox/evals/src/helpers.test.ts b/products/sandbox/evals/src/helpers.test.ts new file mode 100644 index 0000000..e7e5e5f --- /dev/null +++ b/products/sandbox/evals/src/helpers.test.ts @@ -0,0 +1,52 @@ +import { describe, expect, it } from "vitest"; +import { transformCreateBodyForOrchestrator } from "./helpers.js"; +import { toStartupStreamBenchmarkRecords } from "./record.js";output_tokens: promptUsage?.outputTokens ?? null, }, };
+describe("eval helper compatibility", () => {
- it("preserves nested container.image when no top-level image is provided", () => {
- const body = JSON.stringify({
-
projectRef: "demo", -
container: { -
image: "ghcr.io/tangle-network/agent:latest", -
}, - });
- const transformed = JSON.parse(
-
String(transformCreateBodyForOrchestrator(body)), - ) as {
-
container?: { image?: string }; - };
- expect(transformed.container?.image).toBe(
-
"ghcr.io/tangle-network/agent:latest", - );
- });
- it("keeps canonical benchmark dimensions while adding session timings", () => {
- const [record] = toStartupStreamBenchmarkRecords({
-
runId: "run-1", -
generatedAt: "2026-04-23T00:00:00.000Z", -
scenario: { -
name: "Docker / model", -
orchestratorUrl: "http://localhost:4095", -
driver: "docker", -
image: "universal", -
target: "local", -
samples: [ -
{ -
startupResponseMs: 10, -
readyMs: 20, -
sessionCreateMs: 30, -
streamConnectMs: 40, -
firstTokenMs: 50, -
streamTotalMs: 90, -
}, -
], -
}, - });
- expect(record.flow).toBe("new_chat_ttft");
- expect(record.measure).toBe("first_output");
- expect(record.session_create_ms).toBe(30);
- }); +}); diff --git a/products/sandbox/evals/src/helpers.ts b/products/sandbox/evals/src/helpers.ts index 3e2e514..da77682 100644 --- a/products/sandbox/evals/src/helpers.ts +++ b/products/sandbox/evals/src/helpers.ts @@ -39,7 +39,9 @@ function remapPathForOrchestrator(path: string): string {
- the shape the orchestrator expects in direct mode. No-op if the body is
- missing or not JSON. */ -function transformCreateBody(body: BodyInit | null | undefined): BodyInit { +export function transformCreateBodyForOrchestrator(
- body: BodyInit | null | undefined, +): BodyInit { if (typeof body !== "string") return body ?? ""; let parsed: Record<string, unknown>; try { @@ -66,7 +68,63 @@ function transformCreateBody(body: BodyInit | null | undefined): BodyInit { }; parsed.backend = flattened; }
- return JSON.stringify(parsed);
- const image =
- typeof parsed.image === "string"
-
? parsed.image -
: typeof parsed.environment === "string" -
? parsed.environment -
: parsed.container && -
typeof parsed.container === "object" && -
!Array.isArray(parsed.container) && -
typeof (parsed.container as Record<string, unknown>).image === -
"string" -
? ((parsed.container as Record<string, unknown>).image as string) -
: undefined; - const storage =
- parsed.storage && typeof parsed.storage === "object"
-
? { ...(parsed.storage as Record<string, unknown>) } -
: {}; - if (typeof parsed.fromSnapshot === "string") {
- storage.fromSnapshot = parsed.fromSnapshot;
- }
- if (typeof parsed.fromSandboxId === "string") {
- storage.sourceProjectRef = parsed.fromSandboxId;
- }
- return JSON.stringify({
- projectRef:
-
typeof parsed.projectRef === "string" && parsed.projectRef.length > 0 -
? parsed.projectRef -
: parsed.name, - bare: parsed.bare,
- driver: parsed.driver,
- placement: parsed.placement,
- git: parsed.git,
- backend: parsed.backend,
- resources: parsed.resources,
- networking: parsed.networking ?? parsed.network,
- network: parsed.network,
- timeout: parsed.timeout,
- volumes: parsed.volumes,
- security: parsed.security,
- labels: parsed.labels,
- startupScripts: parsed.startupScripts,
- devcontainer: parsed.devcontainer,
- storage: Object.values(storage).some(
-
(value) => value !== undefined && value !== null, - )
-
? storage -
: undefined, - metadata: parsed.metadata,
- container: {
-
...(parsed.container && -
typeof parsed.container === "object" && -
!Array.isArray(parsed.container) -
? (parsed.container as Record<string, unknown>) -
: {}), -
image, -
env: parsed.env, - },
- }); }
/** @@ -161,7 +219,7 @@ export function wrapClientForOrchestratorMode( const remapped = remapPathForOrchestrator(path); let body = options?.body; if (path === "/v1/sandboxes" && options?.method === "POST") {
-
body = transformCreateBody(body);
-
} const headers = new Headers(options?.headers); headers.set("x-user-id", userId); @@ -441,7 +499,7 @@ export async function createSession( sandboxId: string, ): Promise { const body = ctx.directModebody = transformCreateBodyForOrchestrator(body);
- ? { backend: { type: "opencode" } }
- ? { projectId: sandboxId, backend: { type: "opencode" } } : { projectId: sandboxId }; const res = await ctx.api.fetch("/v1/sessions", { method: "POST", diff --git a/products/sandbox/evals/src/record.test.ts b/products/sandbox/evals/src/record.test.ts index b58ef66..250a6b3 100644 --- a/products/sandbox/evals/src/record.test.ts +++ b/products/sandbox/evals/src/record.test.ts @@ -190,6 +190,7 @@ describe("toBenchmarkRecord", () => { recordedAt: "2026-03-30T23:59:00.000Z", startupResponseMs: 1200, readyMs: 3200,
-
sessionCreateMs: 350, streamConnectMs: 200, firstTokenMs: 4500, streamTotalMs: 6800,
diff --git a/products/sandbox/evals/src/record.ts b/products/sandbox/evals/src/record.ts index c4752d3..c9f3719 100644 --- a/products/sandbox/evals/src/record.ts +++ b/products/sandbox/evals/src/record.ts @@ -117,8 +117,18 @@ export interface StartupStreamBenchmarkSample { recordedAt?: string; startupResponseMs: number; readyMs: number;
- sessionCreateMs: number; streamConnectMs: number; firstTokenMs: number;
- secondOutputMs?: number | null;
- thirdOutputMs?: number | null;
- fifthOutputMs?: number | null;
- firstToolInvocationMs?: number | null;
- firstTextEventMs?: number | null;
- eventCount?: number | null;
- outputEventTypes?: string[];
- outputEventTimelineMs?: number[];
- interEventAvgMs?: number | null; streamTotalMs: number; }
@@ -506,15 +516,39 @@ export function toStartupStreamBenchmarkRecords(opts: { target, totalMs: sample.streamTotalMs, provisionMs: sample.readyMs,
-
sessionCreateMs: sample.sessionCreateMs, firstTokenMs: sample.firstTokenMs, taskCompleteMs: sample.streamTotalMs, sseConnectMs: sample.streamConnectMs, -
eventCount: sample.eventCount ?? null, -
interTokenAvgMs: sample.interEventAvgMs ?? null, cacheState: "cold", timingsRaw: [ { name: "startup_response", ms: sample.startupResponseMs }, { name: "runtime_ready", ms: sample.readyMs }, -
{ name: "session_create", ms: sample.sessionCreateMs }, { name: "sse_connect", ms: sample.streamConnectMs },
-
{ name: "first_token", ms: sample.firstTokenMs },
-
{ name: "first_output", ms: sample.firstTokenMs }, -
...(typeof sample.firstTextEventMs === "number" -
? [{ name: "first_text_event", ms: sample.firstTextEventMs }] -
: []), -
...(typeof sample.secondOutputMs === "number" -
? [{ name: "second_output", ms: sample.secondOutputMs }] -
: []), -
...(typeof sample.thirdOutputMs === "number" -
? [{ name: "third_output", ms: sample.thirdOutputMs }] -
: []), -
...(typeof sample.fifthOutputMs === "number" -
? [{ name: "fifth_output", ms: sample.fifthOutputMs }] -
: []), -
...(typeof sample.firstToolInvocationMs === "number" -
? [ -
{ -
name: "first_tool_invocation", -
ms: sample.firstToolInvocationMs, -
}, -
] -
: []), { name: "stream_total", ms: sample.streamTotalMs }, ], tags,
@@ -800,10 +834,13 @@ function createArtifactRecord(opts: { target: string; totalMs: number; provisionMs?: number | null;
- sessionCreateMs?: number | null; firstTokenMs?: number | null; taskCompleteMs?: number | null; deleteMs?: number | null; sseConnectMs?: number | null;
- eventCount?: number | null;
- interTokenAvgMs?: number | null; timingsRaw: Array<{ name: string; ms: number }>; tags: string[]; }): BenchmarkRecord { @@ -833,7 +870,8 @@ function createArtifactRecord(opts: { total_ms: Math.round(opts.totalMs), provision_ms: opts.provisionMs == null ? null : Math.round(opts.provisionMs),
- session_create_ms: null,
- session_create_ms:
-
first_token_ms: opts.firstTokenMs == null ? null : Math.round(opts.firstTokenMs), task_complete_ms: @@ -843,7 +881,7 @@ function createArtifactRecord(opts: { delete_ms: opts.deleteMs == null ? null : Math.round(opts.deleteMs), sse_connect_ms: opts.sseConnectMs == null ? null : Math.round(opts.sseConnectMs),opts.sessionCreateMs == null ? null : Math.round(opts.sessionCreateMs),
- event_count: null,
- event_count: opts.eventCount == null ? null : Math.round(opts.eventCount), token_count: null, file_count: null, sandbox_count: null, @@ -870,7 +908,8 @@ function createArtifactRecord(opts: { install_cold_ms: null, install_warm_ms: null, tokens_per_second: null,
- inter_token_avg_ms: null,
- inter_token_avg_ms:
-
stream_interruptions: null, error_code: null, error_phase: null, diff --git a/products/sandbox/evals/src/runner/index.ts b/products/sandbox/evals/src/runner/index.ts index 69860eb..e820d03 100644 --- a/products/sandbox/evals/src/runner/index.ts +++ b/products/sandbox/evals/src/runner/index.ts @@ -1,6 +1,9 @@ import { randomUUID } from "node:crypto"; import { checkHealth, cleanup } from "../environment/index.js"; -import { resetSdkClient } from "../helpers.js"; +import {opts.interTokenAvgMs == null ? null : Math.round(opts.interTokenAvgMs), - resetSdkClient,
- transformCreateBodyForOrchestrator, +} from "../helpers.js"; import { estimateCost } from "../pricing.js"; import type { ApiClient, @@ -59,8 +62,12 @@ function percentile(values: number[], p: number): number { // Path routing for direct orchestrator mode // ---------------------------------------------------------------------------
-/** Track sidecar URLs per sandbox ID for direct-to-sidecar routing / -const sidecarUrls = new Map<string, string>(); +/* Track sidecar connection info per sandbox ID for direct-to-sidecar routing */ +const sidecarConnections = new Map<
- string,
- { baseUrl: string; authToken?: string } +>(); +const sessionToSandbox = new Map<string, string>();
interface RouteTarget { baseUrl: string; @@ -104,6 +111,15 @@ function routePath( } const msgMatch = rest.match(/^([^/]+)/message$/); if (msgMatch) {
-
const sandboxId = sessionToSandbox.get(msgMatch[1]); -
const sidecar = sandboxId ? sidecarConnections.get(sandboxId) : undefined; -
if (sidecar) { -
return { -
baseUrl: sidecar.baseUrl, -
path: "/agents/run/stream", -
useOrchestratorAuth: false, -
}; -
} return { baseUrl: orchestratorUrl, path: "/agents/run/stream",
@@ -112,6 +128,15 @@ function routePath( } const eventsMatch = rest.match(/^([^/]+)/events$/); if (eventsMatch) {
-
const sandboxId = sessionToSandbox.get(eventsMatch[1]); -
const sidecar = sandboxId ? sidecarConnections.get(sandboxId) : undefined; -
if (sidecar) { -
return { -
baseUrl: sidecar.baseUrl, -
path: `/agents/events?sessionId=${eventsMatch[1]}`, -
useOrchestratorAuth: false, -
}; -
} return { baseUrl: orchestratorUrl, path: `/agents/events?sessionId=${eventsMatch[1]}`,
@@ -132,10 +157,10 @@ function routePath( const runtimeMatch = rest.match(/^/([^/]+)/runtime/(.*)$/); if (runtimeMatch) { const sandboxId = runtimeMatch[1];
-
const sidecarUrl = sidecarUrls.get(sandboxId); -
if (sidecarUrl) {
-
const sidecar = sidecarConnections.get(sandboxId); -
if (sidecar) { return {
-
baseUrl: sidecarUrl,
-
baseUrl: sidecar.baseUrl, path: `/${runtimeMatch[2]}`, useOrchestratorAuth: false, };
@@ -158,23 +183,52 @@ function routePath( }
/**
-
- Look up sidecar URL for a sandbox from the orchestrator's sidecars list.
-
- Look up sidecar connection info for a sandbox from the orchestrator. */ async function resolveSidecarUrl( orchestratorUrl: string, sandboxId: string, headers: Record<string, string>, -): Promise<string | undefined> { +): Promise<{ baseUrl: string; authToken?: string } | undefined> { try {
- const detailsRes = await fetch(
${orchestratorUrl}/projects/${sandboxId}, { -
headers, - });
- if (detailsRes.ok) {
-
const detailsData = (await detailsRes.json()) as { -
project?: { -
connection?: { runtimeUrl?: string; authToken?: string }; -
runtimeUrl?: string; -
}; -
}; -
const project = detailsData.project ?? (detailsData as never); -
const runtimeUrl = -
project?.connection?.runtimeUrl ?? -
(project?.runtimeUrl as string | undefined); -
if (runtimeUrl) { -
return { -
baseUrl: runtimeUrl, -
authToken: project?.connection?.authToken, -
}; -
} - }
- const res = await fetch(
${orchestratorUrl}/sidecars, { headers }); if (!res.ok) return undefined; const data = (await res.json()) as {
-
sidecars: Array<{ id: string; sessionId: string; baseUrl: string }>;
-
sidecars: Array<{ -
id: string; -
sessionId: string; -
baseUrl: string; -
authToken?: string; -
}; const sidecar = data.sidecars?.find( (s) => s.sessionId === sandboxId || s.id === sandboxId, );}>;
- return sidecar?.baseUrl;
-
return sidecar
-
? { baseUrl: sidecar.baseUrl, authToken: sidecar.authToken } -
: undefined;} catch { return undefined; } @@ -193,48 +247,111 @@ function createApiClient(env: EnvironmentConfig): ApiClient {
// For sidecar routes in direct mode, resolve the sidecar URL if not yet known if (direct && !target.useOrchestratorAuth) {
-
const isSessionCreate = -
path === "/v1/sessions" && init?.method === "POST"; -
let body = init?.body; -
if (path === "/v1/sandboxes" && init?.method === "POST") { -
body = transformCreateBodyForOrchestrator(body); -
} -
let sessionSandboxId: string | undefined; -
let routedSessionId: string | undefined; -
if (isSessionCreate && body && typeof body === "string") { -
const parsed = JSON.parse(body) as { projectId?: unknown }; -
sessionSandboxId = -
typeof parsed.projectId === "string" ? parsed.projectId : undefined; -
if (sessionSandboxId && !sidecarConnections.has(sessionSandboxId)) { -
const connection = await resolveSidecarUrl( -
env.apiUrl, -
sessionSandboxId, -
orchHeaders, -
); -
if (connection) { -
sidecarConnections.set(sessionSandboxId, connection); -
} -
} -
} // Try to find sandbox ID from path and resolve sidecar URL const sandboxMatch = path.match(/\/v1\/sandboxes\/([^/]+)/);
-
if (sandboxMatch && !sidecarUrls.has(sandboxMatch[1])) { -
const url = await resolveSidecarUrl(
-
if (sandboxMatch && !sidecarConnections.has(sandboxMatch[1])) { -
const connection = await resolveSidecarUrl( env.apiUrl, sandboxMatch[1], orchHeaders, );
-
if (url) sidecarUrls.set(sandboxMatch[1], url);
-
if (connection) sidecarConnections.set(sandboxMatch[1], connection); -
} -
const sessionMessageMatch = path.match( -
/^\/v1\/sessions\/([^/]+)\/message$/, -
); -
const sessionEventsMatch = path.match( -
/^\/v1\/sessions\/([^/]+)\/events$/, -
); -
routedSessionId = sessionMessageMatch?.[1] ?? sessionEventsMatch?.[1]; -
const routedSandboxId = -
sandboxMatch?.[1] ?? -
(routedSessionId ? sessionToSandbox.get(routedSessionId) : undefined); -
if (routedSandboxId && !sidecarConnections.has(routedSandboxId)) { -
const connection = await resolveSidecarUrl( -
env.apiUrl, -
routedSandboxId, -
orchHeaders, -
); -
if (connection) { -
sidecarConnections.set(routedSandboxId, connection); -
} } // Re-route with resolved sidecar URL const updated = routePath(path, env.apiUrl, direct);
-
const sidecarUrl = !updated.useOrchestratorAuth -
? [...sidecarUrls.values()][0] // Use first known sidecar for session ops -
: undefined; -
const baseUrl = sidecarUrl ?? updated.baseUrl;
-
let sidecar: { baseUrl: string; authToken?: string } | undefined; -
const sandboxId = routedSandboxId; -
if (!updated.useOrchestratorAuth && sandboxId) { -
sidecar = sidecarConnections.get(sandboxId); -
} else if (!updated.useOrchestratorAuth && sessionSandboxId) { -
sidecar = sidecarConnections.get(sessionSandboxId); -
} -
const baseUrl = sidecar?.baseUrl ?? updated.baseUrl; const headers: Record<string, string> = { -
...(sidecar ? {} : orchHeaders), "Content-Type": "application/json", ...(init?.headers as Record<string, string>), };
-
// Add sidecar auth token if configured (required for FC VMs, optional for Docker) -
const sidecarToken = process.env.SIDECAR_AUTH_TOKEN;
-
const sidecarToken = sidecar?.authToken ?? process.env.SIDECAR_AUTH_TOKEN; if (sidecarToken && !headers.Authorization) { headers.Authorization = `Bearer ${sidecarToken}`; }
-
// For message sends, inject sessionId into body (sidecar expects it) -
let body = init?.body; -
const msgMatch = path.match(/\/v1\/sessions\/([^/]+)\/message$/);
-
const msgMatch = sessionMessageMatch; if (msgMatch && body && typeof body === "string") { const parsed = JSON.parse(body); parsed.sessionId = msgMatch[1]; body = JSON.stringify(parsed); }
-
return fetch(`${baseUrl}${updated.path}`, {
-
const response = await fetch(`${baseUrl}${updated.path}`, { ...init, body, headers, }); -
if (isSessionCreate && response.ok && sessionSandboxId) { -
try { -
const payload = (await response.clone().json()) as { -
id?: unknown; -
sessionId?: unknown; -
}; -
const sessionId = -
typeof payload.id === "string" -
? payload.id -
: typeof payload.sessionId === "string" -
? payload.sessionId -
: undefined; -
if (sessionId) { -
sessionToSandbox.set(sessionId, sessionSandboxId); -
} -
} catch { -
// Ignore non-JSON direct session responses and return the original response. -
} -
} -
return response;}
return fetch(
${target.baseUrl}${target.path}, { @@ -254,7 +371,11 @@ function createApiClient(env: EnvironmentConfig): ApiClient { return res.json() as Promise; }, setSidecarUrl: (sandboxId: string, url: string) => {
-
sidecarUrls.set(sandboxId, url);
-
const existing = sidecarConnections.get(sandboxId); -
sidecarConnections.set(sandboxId, { -
baseUrl: url, -
authToken: existing?.authToken, -
}, }; } @@ -274,7 +395,8 @@ async function runScenario( runId: string, ): Promise { // Clear stale state from previous scenarios});
- sidecarUrls.clear();
-
sidecarConnections.clear();
-
sessionToSandbox.clear(); resetSdkClient();
const tracked: Array<{ type: "sandbox" | "session"; id: string }> = []; diff --git a/products/sandbox/sdk/src/client.ts b/products/sandbox/sdk/src/client.ts index 02900cb..f5df8b8 100644 --- a/products/sandbox/sdk/src/client.ts +++ b/products/sandbox/sdk/src/client.ts @@ -73,6 +73,23 @@ function isLocalSandboxEndpoint(baseUrl: string): boolean { } }
+function withRequestContextHeaders(
- response: Response,
- path: string,
- method?: string, +): Response {
- const headers = new Headers(response.headers);
- headers.set("x-tangle-request-path", path);
- if (method) {
- headers.set("x-tangle-request-method", method);
- }
- return new Response(response.body, {
- status: response.status,
- statusText: response.statusText,
- headers,
- }); +}
async function resolveLocalCliAuthFiles( backendType: "codex" | "claude-code", ): Promise<BackendAuthFile[] | undefined> { @@ -327,7 +344,12 @@ export class SandboxClient implements HttpClient {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -376,7 +398,12 @@ export class SandboxClient implements HttpClient {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -413,7 +440,12 @@ export class SandboxClient implements HttpClient {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -441,7 +473,12 @@ export class SandboxClient implements HttpClient {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -471,7 +508,12 @@ export class SandboxClient implements HttpClient {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -543,7 +585,12 @@ export class SandboxClient implements HttpClient {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
return response.json(); @@ -775,12 +822,18 @@ export class SandboxClient implements HttpClient { throw new NetworkError(
Failed to connect to Sandbox API: ${err instanceof Error ? err.message : String(err)}, err instanceof Error ? err : undefined, -
{ endpoint: "/batch/run", origin: "sandbox-api" }, );}
if (!response.ok) { const errBody = await response.text();
-
throw parseErrorResponse(response.status, errBody);
-
throw parseErrorResponse( -
response.status, -
errBody, -
undefined, -
response.headers, -
);}
// The shared parser yields { type, data, id? } and throws @@ -819,7 +872,7 @@ export class SandboxClient implements HttpClient { signal: options?.signal ?? controller.signal, });
-
return response;
-
return withRequestContextHeaders(response, path, options?.method);} catch (err) { if (err instanceof Error && err.name === "AbortError") { throw new TimeoutError(this.timeoutMs); @@ -828,6 +881,7 @@ export class SandboxClient implements HttpClient { throw new NetworkError(
Failed to connect to Sandbox API: ${err instanceof Error ? err.message : String(err)}, err instanceof Error ? err : undefined, -
{ endpoint: path, origin: "sandbox-api" }, );} finally { clearTimeout(timeoutId); @@ -881,17 +935,32 @@ class SecretsManagerImpl implements SecretsManager {
if (response.status === 409) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
if (response.status === 400) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -907,7 +976,12 @@ class SecretsManagerImpl implements SecretsManager {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -931,7 +1005,12 @@ class SecretsManagerImpl implements SecretsManager {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -953,12 +1032,22 @@ class SecretsManagerImpl implements SecretsManager {
if (response.status === 400) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -983,7 +1072,12 @@ class SecretsManagerImpl implements SecretsManager {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} } } @@ -1011,7 +1105,12 @@ class EnvironmentsClient { const response = await this.client.fetch("/v1/environments"); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { environments?: SandboxEnvironment[]; @@ -1094,7 +1193,12 @@ class TeamsClient { const response = await this.client.fetch("/v1/teams"); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { teams?: Team[] }; return data.teams ?? []; @@ -1107,7 +1211,12 @@ class TeamsClient { }); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { team: Team }; return data.team; @@ -1119,7 +1228,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { team: Team }; return data.team; @@ -1132,7 +1246,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { team: Team }; return data.team; @@ -1145,7 +1264,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1156,7 +1280,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1166,7 +1295,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { members?: TeamMember[] }; return data.members ?? []; @@ -1186,7 +1320,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { member: TeamMember }; return data.member; @@ -1199,7 +1338,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1209,7 +1353,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { invitations?: TeamInvitation[] }; return data.invitations ?? []; @@ -1225,7 +1374,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} const data = (await response.json()) as { invitation: TeamInvitation }; return data.invitation; @@ -1238,7 +1392,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text(););
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1249,7 +1408,12 @@ class TeamsClient { ); if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);} const data = (await response.json()) as { member: TeamMember }; return data.member; diff --git a/products/sandbox/sdk/src/collaboration/client.ts b/products/sandbox/sdk/src/collaboration/client.ts index 0297699..de45e86 100644 --- a/products/sandbox/sdk/src/collaboration/client.ts +++ b/products/sandbox/sdk/src/collaboration/client.ts @@ -82,10 +82,15 @@ export class CollaborationClient {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body, { -
method: init.method, -
path, -
});
-
throw parseErrorResponse( -
response.status, -
body, -
{ -
method: init.method, -
path, -
}, -
response.headers, -
); } return response.json() as Promise<T>;
diff --git a/products/sandbox/sdk/src/errors.ts b/products/sandbox/sdk/src/errors.ts index d9a48aa..2d87f9a 100644 --- a/products/sandbox/sdk/src/errors.ts +++ b/products/sandbox/sdk/src/errors.ts @@ -12,21 +12,59 @@ export class SandboxError extends Error { public readonly status?: number; /** Error code for programmatic handling */ public readonly code: string;
- /** Best-effort origin of the failing request */
- public readonly origin?: SandboxErrorOrigin;
- /** Request path or endpoint when known */
- public readonly endpoint?: string;
- /** Retry-after duration in ms when surfaced by the upstream */
- public readonly retryAfterMs?: number;
- /** Sidecar version when the runtime emitted it */
- public readonly sidecarVersion?: string;
- /** Sidecar image/tag/sha when the runtime emitted it */
- public readonly containerImage?: string;
- constructor(message: string, code: string, status?: number) {
- constructor(
- message: string,
- code: string,
- status?: number,
- metadata?: SandboxErrorMetadata,
- ) { super(message); this.name = "SandboxError"; this.code = code; this.status = status;
- this.origin = metadata?.origin;
- this.endpoint = metadata?.endpoint;
- this.retryAfterMs = metadata?.retryAfterMs;
- this.sidecarVersion = metadata?.sidecarVersion;
- this.containerImage = metadata?.containerImage; } }
+export type SandboxErrorOrigin =
- | "sidecar"
- | "sandbox-api"
- | "orchestrator"
- | "runtime"
- | "unknown";
+export interface SandboxErrorMetadata {
- origin?: SandboxErrorOrigin;
- endpoint?: string;
- retryAfterMs?: number;
- sidecarVersion?: string;
- containerImage?: string; +}
/**
- Authentication failed or API key is invalid. */ export class AuthError extends SandboxError {
- constructor(message = "Authentication failed") {
- super(message, "AUTH_ERROR", 401);
- constructor(
- message = "Authentication failed",
- metadata?: SandboxErrorMetadata,
- ) {
- super(message, "AUTH_ERROR", 401, metadata); this.name = "AuthError"; } } @@ -40,8 +78,17 @@ export class NotFoundError extends SandboxError { /** The resource ID that was not found */ public readonly resourceId: string;
- constructor(resourceType: string, resourceId: string) {
- super(
${resourceType} not found: ${resourceId}, "NOT_FOUND", 404);
-
constructor(
-
resourceType: string,
-
resourceId: string,
-
metadata?: SandboxErrorMetadata,
-
) {
-
super(
-
`${resourceType} not found: ${resourceId}`, -
"NOT_FOUND", -
404, -
metadata, -
); this.name = "NotFoundError"; this.resourceType = resourceType; this.resourceId = resourceId; @@ -58,18 +105,27 @@ export class QuotaError extends SandboxError { public readonly current?: number; /** Maximum allowed */ public readonly limit?: number;
-
/** Suggested retry-after duration in ms */
-
public readonly retryAfterMs?: number;
constructor( quotaType: string, message?: string, current?: number, limit?: number,
-
metadata?: SandboxErrorMetadata, ) {
- super(message ??
Quota exceeded: ${quotaType}, "QUOTA_EXCEEDED", 429);
- super(
-
message ?? `Quota exceeded: ${quotaType}`, -
"QUOTA_EXCEEDED", -
429, -
metadata, - ); this.name = "QuotaError"; this.quotaType = quotaType; this.current = current; this.limit = limit;
- this.retryAfterMs = metadata?.retryAfterMs; } }
@@ -80,8 +136,12 @@ export class ValidationError extends SandboxError { /** Field-level validation errors */ public readonly fields?: Record<string, string>;
- constructor(message: string, fields?: Record<string, string>) {
- super(message, "VALIDATION_ERROR", 400);
- constructor(
- message: string,
- fields?: Record<string, string>,
- metadata?: SandboxErrorMetadata,
- ) {
- super(message, "VALIDATION_ERROR", 400, metadata); this.name = "ValidationError"; this.fields = fields; } @@ -96,8 +156,13 @@ export class StateError extends SandboxError { /** Required state for the operation */ public readonly requiredState?: string;
- constructor(message: string, currentState: string, requiredState?: string) {
- super(message, "INVALID_STATE", 409);
- constructor(
- message: string,
- currentState: string,
- requiredState?: string,
- metadata?: SandboxErrorMetadata,
- ) {
- super(message, "INVALID_STATE", 409, metadata); this.name = "StateError"; this.currentState = currentState; this.requiredState = requiredState; @@ -111,8 +176,17 @@ export class TimeoutError extends SandboxError { /** Timeout duration in milliseconds */ public readonly timeoutMs: number;
- constructor(timeoutMs: number, message?: string) {
- super(message ??
Request timed out after ${timeoutMs}ms, "TIMEOUT", 408);
- constructor(
- timeoutMs: number,
- message?: string,
- metadata?: SandboxErrorMetadata,
- ) {
- super(
-
message ?? `Request timed out after ${timeoutMs}ms`, -
"TIMEOUT", -
408, -
metadata, - ); this.name = "TimeoutError"; this.timeoutMs = timeoutMs; } @@ -125,8 +199,18 @@ export class NetworkError extends SandboxError { /** The underlying error */ public readonly cause?: Error;
- constructor(message: string, cause?: Error) {
- super(message, "NETWORK_ERROR");
- constructor(
- message: string,
- causeOrMetadata?: Error | SandboxErrorMetadata,
- metadata?: SandboxErrorMetadata,
- ) {
- const cause =
-
causeOrMetadata instanceof Error ? causeOrMetadata : undefined; - const resolvedMetadata =
-
causeOrMetadata instanceof Error -
? metadata -
: (causeOrMetadata ?? metadata); - super(message, "NETWORK_ERROR", undefined, resolvedMetadata); this.name = "NetworkError"; this.cause = cause; } @@ -136,12 +220,69 @@ export class NetworkError extends SandboxError {
- The server returned an unexpected error. */ export class ServerError extends SandboxError {
- constructor(message: string, status = 500) {
- super(message, "SERVER_ERROR", status);
- constructor(message: string, status = 500, metadata?: SandboxErrorMetadata) {
- super(message, "SERVER_ERROR", status, metadata); this.name = "ServerError"; } }
+function parseRetryAfterMs(
- rawValue: string | undefined,
- data: Record<string, unknown>, +): number | undefined {
- if (
- typeof data.retryAfterMs === "number" &&
- Number.isFinite(data.retryAfterMs)
- ) {
- return Math.max(0, data.retryAfterMs);
- }
- if (!rawValue) return undefined;
- const trimmed = rawValue.trim();
- if (!trimmed) return undefined;
- const seconds = Number(trimmed);
- if (Number.isFinite(seconds)) {
- return Math.max(0, seconds) * 1000;
- }
- const targetTs = Date.parse(trimmed);
- if (!Number.isNaN(targetTs)) {
- return Math.max(0, targetTs - Date.now());
- }
- return undefined; +}
+function inferOrigin(
- headers?: Headers,
- context?: { method?: string; path?: string }, +): SandboxErrorOrigin | undefined {
- const derivedPath =
- context?.path ?? headers?.get("x-tangle-request-path") ?? undefined;
- if (!headers) return context?.path ? "sandbox-api" : undefined;
- if (
- headers.has("x-sidecar-version") ||
- headers.has("x-sidecar-image") ||
- headers.has("x-sidecar-image-tag")
- ) {
- return "sidecar";
- }
- if (headers.has("x-orchestrator-version")) {
- return "orchestrator";
- }
- if (derivedPath?.startsWith("/v1/")) {
- return "sandbox-api";
- }
- if (derivedPath) {
- if (
-
derivedPath.startsWith("/projects") || -
derivedPath.startsWith("/sidecars") || -
derivedPath.startsWith("/instances") - ) {
-
return "orchestrator"; - }
- return "runtime";
- }
- return "unknown"; +}
/**
- Parse an error response from the API.
- @param status - HTTP status code @@ -152,6 +293,7 @@ export function parseErrorResponse( status: number, body: string, context?: { method?: string; path?: string },
-
headers?: Headers, ): SandboxError { let data: Record<string, unknown>; try { @@ -177,34 +319,66 @@ export function parseErrorResponse( (typeof errorObj === "string" ? errorObj : undefined) || body || "Unknown error";
-
const details =
-
typeof data.details === "string" && data.details.trim().length > 0
-
? data.details.trim() -
: undefined; -
const shouldAppendDetails =
-
!!details &&
-
details !== baseMessage &&
-
(baseMessage === "Unknown error" ||
-
/^failed\b/i.test(baseMessage) || -
/^provision failed\b/i.test(baseMessage) || -
/^deprovision failed\b/i.test(baseMessage));const code = (data.code as string | undefined) || nestedCode;
// Add request context to message for easier debugging const prefix = context ?
${context.method ?? "REQUEST"} ${context.path ?? ""}:: "";
- const message =
${prefix}${baseMessage};
-
const message =
${prefix}${shouldAppendDetails ?${baseMessage}: ${details}: baseMessage}; -
const metadata: SandboxErrorMetadata = {
-
origin: inferOrigin(headers, context),
-
endpoint:
-
context?.path ?? headers?.get("x-tangle-request-path") ?? undefined, -
retryAfterMs: parseRetryAfterMs(
-
headers?.get("retry-after") ?? undefined, -
data, -
),
-
sidecarVersion: headers?.get("x-sidecar-version") ?? undefined,
-
containerImage:
-
headers?.get("x-sidecar-image") ?? -
headers?.get("x-sidecar-image-tag") ?? -
undefined, -
};
switch (status) { case 400: return new ValidationError( message, data.fields as Record<string, string> | undefined,
-
metadata, );case 401:
-
return new AuthError(message);
-
case 404: return new NotFoundError( (data.resourceType as string) || "Resource", (data.resourceId as string) || "unknown",return new AuthError(message, metadata); -
case 408:metadata, );
-
return new TimeoutError((data.timeoutMs as number) || 30000, message);
-
return new TimeoutError( -
(data.timeoutMs as number) || 30000, -
message, -
metadata, -
case 409: return new StateError( message, (data.currentState as string) || "unknown", data.requiredState as string | undefined,); -
case 429: return new QuotaError( @@ -212,11 +386,24 @@ export function parseErrorResponse( message, data.current as number | undefined, data.limit as number | undefined,metadata, ); -
metadata, -
); - case 501:
-
return new SandboxError( -
message, -
code || "NOT_IMPLEMENTED", -
status, -
default: if (status >= 500) {metadata, );
-
return new ServerError(message, status);
-
return new ServerError(message, status, metadata); }
-
return new SandboxError(message, code || "UNKNOWN_ERROR", status);
-
return new SandboxError( -
message, -
code || "UNKNOWN_ERROR", -
status, -
metadata, -
} } diff --git a/products/sandbox/sdk/src/orchestrator.ts b/products/sandbox/sdk/src/orchestrator.ts index 315c7f6..5af1b6a 100644 --- a/products/sandbox/sdk/src/orchestrator.ts +++ b/products/sandbox/sdk/src/orchestrator.ts @@ -224,6 +224,23 @@ function normalizeBaseUrl(url: string): string { return url.replace(//+$/, ""); });
+function withRequestContextHeaders(
- response: Response,
- path: string,
- method?: string, +): Response {
- const headers = new Headers(response.headers);
- headers.set("x-tangle-request-path", path);
- if (method) {
- headers.set("x-tangle-request-method", method);
- }
- return new Response(response.body, {
- status: response.status,
- statusText: response.statusText,
- headers,
- }); +}
export class OrchestratorClient { private readonly baseUrl: string; private readonly apiKey: string; @@ -473,10 +490,15 @@ export class OrchestratorClient { const response = await this.fetch(userId, path, init); const text = await response.text(); if (!response.ok) {
-
throw parseErrorResponse(response.status, text, { -
method: init.method, -
path, -
});
-
throw parseErrorResponse( -
response.status, -
text, -
{ -
method: init.method, -
path, -
}, -
response.headers, -
);} return text ? (JSON.parse(text) as T) : ({} as T); } @@ -491,11 +513,12 @@ export class OrchestratorClient { const timeoutId = setTimeout(() => controller.abort(), this.timeoutMs);
try {
-
return await globalThis.fetch(url, {
-
const response = await globalThis.fetch(url, { ...options, headers: options?.headers ?? this.buildHeaders(userId), signal: options?.signal ?? controller.signal, }); -
} catch (err) { if (err instanceof Error && err.name === "AbortError") { throw new TimeoutError(this.timeoutMs); diff --git a/products/sandbox/sdk/src/sandbox.ts b/products/sandbox/sdk/src/sandbox.ts index 80a48d9..2b2e53f 100644 --- a/products/sandbox/sdk/src/sandbox.ts +++ b/products/sandbox/sdk/src/sandbox.ts @@ -337,7 +337,12 @@ export class SandboxInstance { const response = await this.client.fetch(return withRequestContextHeaders(response, path, options?.method);/v1/sandboxes/${this.id}); if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);} const data = await response.json(); this.info = this.parseInfo(data); @@ -375,7 +380,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
return response.json(); @@ -404,7 +414,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
return response.json(); @@ -452,7 +467,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -498,7 +518,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -528,7 +553,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -650,7 +680,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
// Track state for reconnection @@ -762,7 +797,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
for await (const event of this.parseSSEStream(response, options?.signal)) { @@ -1042,7 +1082,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1104,7 +1149,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1129,7 +1179,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1157,7 +1212,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1179,7 +1239,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1196,7 +1261,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1220,7 +1290,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1234,7 +1309,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1247,7 +1327,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1267,7 +1352,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1397,7 +1487,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
// Report completion @@ -1429,7 +1524,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const contentLength = response.headers.get("content-length"); @@ -1500,7 +1600,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
); }} finally { // Clean up temp file @@ -1524,7 +1629,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
// Save tar to temp file and extract @@ -1568,7 +1678,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1600,7 +1715,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1635,7 +1755,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1653,7 +1778,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1666,7 +1796,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1727,7 +1862,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1762,7 +1902,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const u = await response.json(); @@ -1789,7 +1934,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const u = await response.json(); @@ -1822,7 +1972,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const u = await response.json(); @@ -1860,7 +2015,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1880,7 +2040,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -1896,7 +2061,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1920,7 +2090,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -1970,7 +2145,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
return response.json(); @@ -1985,7 +2165,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
return response.json(); @@ -2004,7 +2189,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -2019,7 +2209,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2038,7 +2233,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -2051,7 +2251,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -2124,7 +2329,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2150,7 +2360,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2166,7 +2381,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2197,7 +2417,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2218,7 +2443,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
); } const data = await response.json();
@@ -2254,7 +2484,12 @@ export class SandboxInstance {
if (!response.ok && response.status !== 404) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
); } },
@@ -2265,7 +2500,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
); } yield* self.parseProcessLogStream(response);
@@ -2405,7 +2645,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -2421,7 +2666,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2433,7 +2683,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2445,7 +2700,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
return response.json(); @@ -2528,7 +2788,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2540,7 +2805,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2555,7 +2825,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -2577,7 +2852,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
return response.json(); @@ -2593,7 +2873,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -2607,7 +2892,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -2620,7 +2910,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2637,7 +2932,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2699,7 +2999,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
); } const data = await response.json();
@@ -2725,7 +3030,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2793,7 +3103,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
); } const data = await response.json();
@@ -2818,7 +3133,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -2846,7 +3166,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = (await response.json()) as Record<string, unknown>; @@ -2873,7 +3198,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
); } const status =
@@ -2926,7 +3256,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -2979,7 +3314,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -3047,7 +3387,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -3079,7 +3424,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -3103,7 +3453,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -3155,7 +3510,12 @@ export class SandboxInstance {
if (!response.ok) {
const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
const data = await response.json(); @@ -3180,7 +3540,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
await this.refresh(); @@ -3199,7 +3564,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
);}
await this.refresh(); @@ -3215,7 +3585,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -3529,6 +3904,7 @@ export class SandboxInstance {
throw new NetworkError(
Failed to connect to sandbox: ${err instanceof Error ? err.message : String(err)},
err instanceof Error ? err : undefined,
-
{ endpoint: path, origin: "runtime" }, );} } @@ -3606,7 +3982,12 @@ export class SandboxInstance {
if (!response.ok) { const body = await response.text();
-
throw parseErrorResponse(response.status, body);
-
throw parseErrorResponse( -
response.status, -
body, -
undefined, -
response.headers, -
} });
@@ -3706,7 +4087,17 @@ class DirectRuntimeHttpClient implements HttpClient { init.duplex = "half"; }
- return fetch(targetUrl, init);
- const response = await fetch(targetUrl, init);
- const responseHeaders = new Headers(response.headers);
- responseHeaders.set("x-tangle-request-path", path);
- if (options?.method) {
-
responseHeaders.set("x-tangle-request-method", options.method); - }
- return new Response(response.body, {
-
status: response.status, -
statusText: response.statusText, -
headers: responseHeaders, - }); } }
diff --git a/products/sandbox/sdk/tests/e2e/snapshot-lifecycle.test.ts b/products/sandbox/sdk/tests/e2e/snapshot-lifecycle.test.ts index 176fb36..6753a12 100644 --- a/products/sandbox/sdk/tests/e2e/snapshot-lifecycle.test.ts +++ b/products/sandbox/sdk/tests/e2e/snapshot-lifecycle.test.ts @@ -30,7 +30,25 @@ async function waitForFileContents( await sandbox.refresh(); }
- throw lastError instanceof Error ? lastError : new Error(String(lastError));
- let diagnostics = "";
- try {
- const tree = await sandbox.exec(
-
'pwd && printf "\\n---\\n" && env | grep -E "^(AGENT_WORKSPACE_ROOT|WORKSPACE_PATH|STORAGE_PATH|HOME)=" | sort && printf "\\n---\\n" && find /home/agent -maxdepth 3 \\( -type f -o -type d \\) | sort | sed -n "1,200p"', -
{ timeoutMs: 20_000 }, - );
- diagnostics =
\nWorkspace diagnostics:\n${tree.stdout}; - } catch (diagnosticError) {
- diagnostics = `\nWorkspace diagnostics unavailable: ${
-
diagnosticError instanceof Error -
? diagnosticError.message -
: String(diagnosticError) - }`;
- }
- if (lastError instanceof Error) {
- throw new Error(
${lastError.message}${diagnostics}); - }
- throw new Error(
${String(lastError)}${diagnostics}); }
describe("Snapshot Lifecycle E2E", () => { diff --git a/products/sandbox/sdk/tests/e2e/task-execution.test.ts b/products/sandbox/sdk/tests/e2e/task-execution.test.ts index 82f00fe..579a10c 100644 --- a/products/sandbox/sdk/tests/e2e/task-execution.test.ts +++ b/products/sandbox/sdk/tests/e2e/task-execution.test.ts @@ -5,6 +5,7 @@
- If these fail, the product is broken. */
+import { execFileSync } from "node:child_process"; import { afterAll, beforeAll, describe, expect, it } from "vitest"; import type { SandboxInstance } from "../../src/index.js"; import { EventCollector } from "../helpers/event-collector.js"; @@ -14,6 +15,96 @@ import { } from "../helpers/sandbox-test-context.js";
describe("Task Execution E2E", () => {
-
async function logRuntimeDiagnostics(
-
sandbox: SandboxInstance,
-
): Promise {
-
try {
-
const execResult = await sandbox.exec( -
[ -
"echo PATH=$PATH", -
"echo HOME=$HOME", -
"echo AGENT_WORKSPACE_ROOT=$AGENT_WORKSPACE_ROOT", -
"command -v opencode || true", -
'op=$(command -v opencode || true); [ -n "$op" ] && { ls -l "$op"; stat -c \'%a %U:%G %n\' "$op"; } || true', -
"command -v node || true", -
'nodep=$(command -v node || true); [ -n "$nodep" ] && { ls -l "$nodep"; stat -c \'%a %U:%G %n\' "$nodep"; } || true', -
"pwd", -
"ls -ld /home/agent /home/agent/* 2>/dev/null | sed -n '1,40p'", -
].join(" && "), -
); -
console.log( -
"Runtime exec diagnostics:", -
JSON.stringify(execResult, null, 2), -
); -
} catch (error) {
-
console.log("Runtime exec diagnostics failed:", String(error)); -
}
-
const connection = sandbox.connection;
-
if (!connection?.runtimeUrl || !connection.authToken) {
-
console.log("Runtime diagnostics unavailable: missing connection info"); -
return; -
}
-
const headers = {
-
Authorization: `Bearer ${connection.authToken}`, -
};
-
try {
-
const runtimeHost = new URL(connection.runtimeUrl).hostname; -
const containerIds = execFileSync( -
"docker", -
["ps", "--format", "{{.ID}}"], -
{ encoding: "utf-8" }, -
) -
.trim() -
.split("\n") -
.filter(Boolean); -
let matchedContainerId: string | undefined; -
for (const containerId of containerIds) { -
const ipAddress = execFileSync( -
"docker", -
[ -
"inspect", -
"-f", -
"{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}", -
containerId, -
], -
{ encoding: "utf-8" }, -
).trim(); -
if (ipAddress === runtimeHost) { -
matchedContainerId = containerId; -
break; -
} -
} -
console.log("Runtime docker mapping:", matchedContainerId ?? "not found"); -
if (matchedContainerId) { -
const dockerLogs = execFileSync( -
"docker", -
["logs", matchedContainerId], -
{ -
encoding: "utf-8", -
}, -
); -
console.log("Runtime docker logs:", dockerLogs); -
} -
} catch (error) {
-
console.log("Runtime docker diagnostics failed:", String(error)); -
}
-
for (const path of ["/debug", "/debug/logs?limit=100"]) {
-
try { -
const response = await fetch(`${connection.runtimeUrl}${path}`, { -
headers, -
}); -
const body = await response.text(); -
console.log(`Runtime diagnostics ${path}:`, response.status, body); -
} catch (error) { -
console.log(`Runtime diagnostics ${path} failed:`, String(error)); -
} -
}
-
}
-
let ctx: SandboxTestContext;
beforeAll(async () => { @@ -172,6 +263,9 @@ describe("Task Execution E2E", () => { );
console.log("Prompt result:", JSON.stringify(result, null, 2)); -
if (!result.success) { -
await logRuntimeDiagnostics(sandbox); -
} expect(result.success).toBe(true); expect(result.response).toBeDefined();
diff --git a/products/sandbox/sdk/tests/helpers/orchestrator-server.ts b/products/sandbox/sdk/tests/helpers/orchestrator-server.ts index 1476598..b74b61b 100644 --- a/products/sandbox/sdk/tests/helpers/orchestrator-server.ts +++ b/products/sandbox/sdk/tests/helpers/orchestrator-server.ts @@ -10,6 +10,8 @@
import { type ChildProcess, execFile, spawn } from "node:child_process"; import { randomBytes } from "node:crypto"; +import { writeFileSync } from "node:fs"; +import { tmpdir } from "node:os"; import { join } from "node:path"; import { config as dotenvConfig } from "dotenv"; import { @@ -176,7 +178,6 @@ export async function startOrchestratorServer( const explicitEnvApiKey = process.env.OPENCODE_MODEL_API_KEY; const explicitEnvModel = process.env.OPENCODE_MODEL_ID ?? process.env.OPENCODE_MODEL_NAME;
- const explicitEnvBaseUrl = process.env.OPENCODE_MODEL_BASE_URL; const hasExplicitOpencodeEnv = typeof explicitEnvProvider === "string" && explicitEnvProvider.length > 0 && @@ -216,12 +217,17 @@ export async function startOrchestratorServer( // Final defaults - zai-coding-plan has a free tier llmProvider = llmProvider ?? "zai-coding-plan"; llmModel = llmModel ?? "glm-4.7";
- const llmBaseUrl =
- config.llmBaseUrl ??
- (hasExplicitOpencodeEnv ? explicitEnvBaseUrl : undefined);
-
// Keep E2E model routing hermetic. Ambient OPENCODE_MODEL_BASE_URL from a
-
// developer shell can silently force "direct provider" tests through a stale
-
// local proxy and produce hangs/false negatives. Only use an explicit test
-
// config override or the LiteLLM instance we start here.
-
const llmBaseUrl = config.llmBaseUrl;
// Sidecar image configuration
- const sidecarImage = process.env.SIDECAR_IMAGE || "agent-sidecar:test";
- const sidecarImage =
- process.env.TEST_SIDECAR_IMAGE ||
- process.env.SIDECAR_IMAGE ||
- "tangle-sidecar:local"; const storageAgent = await startTestStorageAgent();
// Environment for orchestrator process @@ -280,10 +286,23 @@ export async function startOrchestratorServer( DOCKER_SIDECAR_USE_LOCALHOST: "true", // Sidecar configuration SIDECAR_IMAGE: sidecarImage,
-
DEFAULT_CONTAINER_IMAGE:
-
config.driver === "local" -
? parentEnv.DEFAULT_CONTAINER_IMAGE || "node:24-alpine" -
: parentEnv.DEFAULT_CONTAINER_IMAGE || sidecarImage,SIDECAR_ENABLED: "true",
-
SIDECAR_DEBUG_ENABLED: "true", RESTIC_PASSWORD: process.env.RESTIC_PASSWORD || "test-restic-password", };
-
// Prevent the spawned orchestrator from auto-loading apps/orchestrator/.env
-
// via
dotenv/config, which can silently reintroduce stale local LiteLLM -
// defaults (for example DEFAULT_OPENCODE_BASE_URL) into "direct provider"
-
// E2E runs.
-
const hermeticEnvFile = join(tmpdir(),
orchestrator-sdk-e2e-${port}.env); -
writeFileSync(hermeticEnvFile, "", "utf8");
-
env.DOTENV_CONFIG_PATH = hermeticEnvFile;
-
if (storageAgent) { env.TEST_STORAGE_AGENT_URL = storageAgent.url; env.LOCAL_DRIVER_HOST_AGENT_URL = storageAgent.url; diff --git a/products/sandbox/sdk/tests/helpers/test-storage-agent.ts b/products/sandbox/sdk/tests/helpers/test-storage-agent.ts index 8392624..0170aa5 100644 --- a/products/sandbox/sdk/tests/helpers/test-storage-agent.ts +++ b/products/sandbox/sdk/tests/helpers/test-storage-agent.ts @@ -362,8 +362,16 @@ export async function startTestStorageAgent(): Promise return; }
-
if (snapshotListMatch && method === "GET") { -
const projectRef = decodeURIComponent(snapshotListMatch[1] ?? "");
-
if ( -
(snapshotListMatch && method === "GET") || -
(url.pathname.match(/^\/snapshots\/([^/]+)\/list$/) && -
method === "POST") -
) { -
const projectRef = decodeURIComponent( -
snapshotListMatch?.[1] ?? -
url.pathname.match(/^\/snapshots\/([^/]+)\/list$/)?.[1] ?? -
"", -
); json(res, 200, { snapshots: (snapshots.get(projectRef) ?? []).map((record) => ({ id: record.id,
diff --git a/products/sandbox/sdk/tests/setup.ts b/products/sandbox/sdk/tests/setup.ts index efde1b5..27d7fdb 100644 --- a/products/sandbox/sdk/tests/setup.ts +++ b/products/sandbox/sdk/tests/setup.ts @@ -99,9 +99,11 @@ beforeAll(async () => { try { const useLocalDriver = forceLocalDriver || !dockerAvailable; activeDriver = useLocalDriver ? "local" : "docker";
-
const enableLiteLLM = -
!useLocalDriver && process.env.SANDBOX_E2E_USE_LITELLM === "true"; const ctx = getSharedOrchestratorContext({ useLocalDriver,
-
startLiteLLM: hasKeys && !useLocalDriver,
-
startLiteLLM: enableLiteLLM, }); console.log("Starting orchestrator server...");
diff --git a/products/sandbox/sdk/tests/unit/errors.test.ts b/products/sandbox/sdk/tests/unit/errors.test.ts new file mode 100644 index 0000000..b18dfab --- /dev/null +++ b/products/sandbox/sdk/tests/unit/errors.test.ts @@ -0,0 +1,112 @@ +import { describe, expect, it } from "vitest"; +import { SandboxClient } from "../../src/client.js"; +import {
- NetworkError,
- parseErrorResponse,
- QuotaError, +} from "../../src/errors.js";
+describe("sandbox SDK error metadata", () => {
- it("preserves sidecar headers and retry-after on rate limits", () => {
- const headers = new Headers({
-
"retry-after": "45", -
"x-sidecar-version": "1.2.3", -
"x-sidecar-image": "sidecar:local@sha256:abc", - });
- const err = parseErrorResponse(
-
429, -
JSON.stringify({ -
message: "Too many requests", -
quotaType: "terminal", -
current: 51, -
limit: 50, -
}), -
{ method: "GET", path: "/fs/exists" }, -
headers, - );
- expect(err).toBeInstanceOf(QuotaError);
- expect(err.origin).toBe("sidecar");
- expect(err.endpoint).toBe("/fs/exists");
- expect(err.sidecarVersion).toBe("1.2.3");
- expect(err.containerImage).toBe("sidecar:local@sha256:abc");
- expect((err as QuotaError).retryAfterMs).toBe(45_000);
- });
- it("attaches endpoint metadata to network errors", () => {
- const cause = new Error("connect ECONNREFUSED");
- const err = new NetworkError(
-
"Failed to connect to Sandbox API: connect ECONNREFUSED", -
cause, -
{ endpoint: "/v1/sandboxes", origin: "sandbox-api" }, - );
- expect(err.cause).toBe(cause);
- expect(err.endpoint).toBe("/v1/sandboxes");
- expect(err.origin).toBe("sandbox-api");
- });
- it("keeps metadata when the network cause is absent", () => {
- const err = new NetworkError("socket closed", undefined, {
-
endpoint: "/projects/example", -
origin: "orchestrator", - });
- expect(err.endpoint).toBe("/projects/example");
- expect(err.origin).toBe("orchestrator");
- });
- it("preserves explicit 501 error codes instead of collapsing to SERVER_ERROR", () => {
- const err = parseErrorResponse(
-
501, -
JSON.stringify({ -
code: "SNAPSHOT_SERVICE_UNAVAILABLE", -
message: "Snapshots are unavailable for this backend", -
}), -
{ method: "POST", path: "/v1/sandboxes/demo/snapshots" }, -
new Headers(), - );
- expect(err.code).toBe("SNAPSHOT_SERVICE_UNAVAILABLE");
- expect(err.status).toBe(501);
- });
- it("classifies orchestrator paths outside /v1 as orchestrator errors", () => {
- const err = parseErrorResponse(
-
502, -
JSON.stringify({ message: "upstream unavailable" }), -
{ method: "POST", path: "/projects/demo" }, -
new Headers(), - );
- expect(err.origin).toBe("orchestrator");
- });
- it("derives endpoint and origin from stamped response headers", () => {
- const err = parseErrorResponse(
-
500, -
JSON.stringify({ message: "boom" }), -
undefined, -
new Headers({ -
"x-tangle-request-path": "/v1/sandboxes/demo/files/read", -
}), - );
- expect(err.endpoint).toBe("/v1/sandboxes/demo/files/read");
- expect(err.origin).toBe("sandbox-api");
- });
- it("uses the actual request path for network failures from the shared client", async () => {
- const client = new SandboxClient({
-
apiKey: "test-key", -
baseUrl: "http://127.0.0.1:1", -
timeoutMs: 100, - });
- await expect(client.fetch("/v1/sandboxes/demo")).rejects.toMatchObject({
-
endpoint: "/v1/sandboxes/demo", -
origin: "sandbox-api", - });
- }); +});
warning: Codex's Linux sandbox uses bubblewrap and needs access to create user namespaces. 2026-04-23T08:38:53.409735Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses 2026-04-23T08:38:53.792411Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses 2026-04-23T08:38:54.296828Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses ERROR: Reconnecting... 2/5 2026-04-23T08:38:55.001407Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses ERROR: Reconnecting... 3/5 2026-04-23T08:38:56.091168Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses ERROR: Reconnecting... 4/5 2026-04-23T08:38:58.214285Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses ERROR: Reconnecting... 5/5 2026-04-23T08:39:01.914344Z ERROR codex_api::endpoint::responses_websocket: failed to connect to websocket: HTTP error: 401 Unauthorized, url: wss://api.openai.com/v1/responses ERROR: Reconnecting... 1/5 ERROR: Reconnecting... 2/5 ERROR: Reconnecting... 3/5 ERROR: Reconnecting... 4/5 ERROR: Reconnecting... 5/5 ERROR: unexpected status 401 Unauthorized: Missing bearer or basic authentication in header, url: https://api.openai.com/v1/responses, cf-ray: 9f0b8e5bcfc31e1a-ORD, request id: req_05e5f5dd01484e9b8beef11d23173f4a ERROR: unexpected status 401 Unauthorized: Missing bearer or basic authentication in header, url: https://api.openai.com/v1/responses, cf-ray: 9f0b8e5bcfc31e1a-ORD, request id: req_05e5f5dd01484e9b8beef11d23173f4a