RFC: Rethinking Sandbox lifecycles in @codesandbox/sdk

1. History of Hibernation and Persistence

At CodeSandbox, we built a product where users could treat their sandboxes like cloud-based laptops. When a user stepped away from a project, the sandbox would automatically hibernate after a period of inactivity. Upon resuming, the sandbox would restore to its exact previous state — both in memory and persistence — almost instantly.

Given that most sandboxes were small, short-lived projects, we introduced an automatic archiving mechanism. After 7 days of inactivity, a sandbox would be archived. This system allowed us to manage persistence without requiring user intervention. It was opinionated, reliable, and tailored to a single use case that worked well at our scale.

Additionally, the CodeSandbox product introduced a feature called Live Forking. This allowed users to fork a running sandbox with the new sandbox sharing memory from the original. This enabled seamless flows such as starting in a read-only, always-up-to-date main branch and then moving into a writable sandbox branch without any interruptions.

This approach provided:

A simple mental model for users: "My sandbox is always where I left it."
Cost-effective resource management: unused environments were automatically hibernated or archived.
Minimal user configuration: persistence and storage were handled behind the scenes.

2. What We Have Learned

When we pivoted the CodeSandbox product toward the SDK, the number of use cases expanded dramatically — many of which were not accounted for in our original stack or assumptions. This shift introduced both challenges in scalability and adaptability.

The core friction in the current SDK stems from our opinionated approach to hibernation and persistence in the CodeSandbox product.

One major issue is automatic hibernation, which is controlled by a timeout that SDK users can configure. However, this timeout is only extended by specific protocol messages sent from the SDK client. Other forms of activity — such as direct HTTP requests or file system operations — do not affect the timeout. This design has proven confusing and brittle across different usage scenarios.

The timeout is also managed inside the sandbox, making it fragile. We've encountered multiple cases where this internal state drifted or failed, causing sandboxes to remain alive beyond the intended timeout — or to hibernate unexpectedly.

Additionally, while SDK users can configure sandboxes to wake on HTTP or WebSocket connections, those interactions do not reset the timeout. This has led to frustration keeping the VM alive.

In response to scaling pressures, we recently shortened the automatic archive window to 4 days, with periods of 2 days during peak loads to stabilize clusters. While this helped improve system stability, it also introduced less predictable resume behavior. If a sandbox is archived, it will boot from a fresh state (CLEAN) instead of a resumed memory snapshot (RESUME).

Even though this distinction is observable via the bootupType, it adds complexity for SDK integrators and hurts user experience. SDK users now have to:

Detect which state the sandbox is starting from
Account for significantly different startup durations
Handle potential end-user confusion when a sandbox suddenly takes longer to load

Finally, the archive-recovery path increases the likelihood of edge-case failures, making integrations more error-prone and harder to support.

The Live Forking feature also introduced a significant scalability challenge. In some scenarios, thousands of sandboxes would simultaneously read from the memory of a single origin sandbox. This led to serious system bottlenecks and degraded performance across the platform.

As a final note, we’ve learned that SDK users — quite understandably — have done whatever they could to make our system work for their products. However, the wide range of new use cases has proven incompatible with our current hibernation, persistence, and forking behaviors. This fundamental mismatch is a key reason we've encountered so many reliability and scalability issues.

3. What We Want to Do

In short, we want to do three things:

1. Remove the current hibernation timeout and automatic wakeup

We know that SDK users intuitively prefer a timeout mechanism. To be clear: we’re not saying timeouts will never exist, but the current implementation is inconsistent and too fragile. Please reach out to us if you still find timeouts to be the best option for your product, but also evaluate these considerations which resulted in active lifecycle management being best practice.

Should all requests to the sandbox extend the timeout, or only some?
Should any internal process in the sandbox be able to extend the timeout?
What if the timeout fails to trigger hibernation — where does the error go, and what should happen?
Timeouts are a tradeoff between cost and UX. Shorter timeouts reduce cost but negatively impact UX, since hibernating adds latency to the resume. Longer timeouts reduce UX friction but raise costs. This is very difficult to control.
If you do not control the state of the Sandbox you are forced to resume or query our platform before any interaction, which adds latency.

All of these challenges are solved if SDK users use their business logic to explicitly hibernate.

2. Use a REST-based Sandbox Agent

The current SDK client requires websocket, which adds complexity to the interface and managing a connection in different environments. With a REST based Sandbox Agent we simplify the mental model, the interface and there is no connection to manage.

3. Introduce long-term persistence

When a Sandbox is hibernated we create a snapshot. If the snapshot is not resumed within 2-7 days, depending on the health of the cluster, we archive the Sandbox. This makes resume unpredictable as it normally takes 1-3 seconds, but can take up to 60 seconds when the Sandbox is archived. With long-term snapshot persistence we aim to not have archive of Sandboxes, giving you predictable resume times of 1-3 seconds.

4. The Next Steps

SDK v2.3.0

Our first step is to release a "Best Practices" SDK. This is a NON-BREAKING change.

Add a new delete method
Make user creation optional id in connect and createSession

In this release we will also publish our updated "Best Practices" docs. Our goal for this release is to take feedback and conclude on active lifecycle management VS timeout.

Long Term Persistence

Our second step is to introduce a new persistence mechanism. As mentioned this will give you predictability on Sandbox resume. As part of this release new monitoring tools will become available in the dashboard. Long Term Persistence is also needed to use our new Sandbox Agent.

SDK v2.4.0

Our third step is to deploy our new REST-based Sandbox Agent. This will also be a NON-BREAKING change. The following deprecations will happen:

sandbox.createSession() — the new Sandbox Agent only has a single root user
sandbox.connect() — the new Sandbox Agent is REST-only
connectToSandbox() — with a REST based Sandbox Agent it is safer and simpler to make sandbox requests through your own server

We are now able to expose the SandboxClient interface directly on the sandbox.

let sandbox = await sdk.sandboxes.resume('some-sandbox-id')

// You can still do this, but it is considered deprecated
const client = await sandbox.connect()
await client.commands.run('echo "hello world"')

// Lean on your existing agent update check to ensure running on new sandbox agent
sandbox = await (sandbox.isUpToDate ? sandbox : sdk.sandboxes.restart(sandbox.id))

// No reason to connect anymore, just start interacting
sandbox.commands.run('echo "hello world"')

// Configure git and env variables with
sandbox.configureEnv({
  git: {},
  env: {}
})

SDK v3.0.0

Our fourth step is to deploy a new infrastructure where only the new Sandbox Agent runs. This concludes the effort to provide a first class infrastructure designed for the SDK use cases.

Feedback & Collaboration

We want to take this opportunity to thank our users for sticking with us as we’ve explored and adapted to a wide variety of new use cases. One of the most important things we’ve learned is that you want to manage sandboxes like a low-level, simple resource you have full control over — not like the high-level “laptop behavior” we originally designed for the CodeSandbox product.

Your feedback has been invaluable in shaping the direction of the SDK.

As we move forward with this new model, we invite you to reach out with: • Comments or concerns • Specific use cases you’d like to discuss • Invitations to feedback sessions or implementation discussions

We’re committed to making this transition smooth and to giving you the tools and flexibility you need.

With love, The CodeSandbox SDK Team ❤️

christianalfoni/rfc_sdk.md