Skip to content

Instantly share code, notes, and snippets.

@christianalfoni
Last active August 8, 2025 09:56
Show Gist options
  • Save christianalfoni/70335f989312484c8f348148aa2e166e to your computer and use it in GitHub Desktop.
Save christianalfoni/70335f989312484c8f348148aa2e166e to your computer and use it in GitHub Desktop.

RFC: Rethinking Hibernation and Persistence in @codesandbox/sdk

1. History of Hibernation and Persistence

At CodeSandbox, we built a product where users could treat their sandboxes like cloud-based laptops. When a user stepped away from a project, the sandbox would automatically hibernate after a period of inactivity. Upon resuming, the sandbox would restore to its exact previous state — both in memory and persistence — almost instantly.

Given that most sandboxes were small, short-lived projects, we introduced an automatic archiving mechanism. After 7 days of inactivity, a sandbox would be archived. This system allowed us to manage persistence without requiring user intervention. It was opinionated, reliable, and tailored to a single use case that worked well at our scale.

Additionally, the CodeSandbox product introduced a feature called Live Forking. This allowed users to instantly fork a running sandbox they were viewing, with the new sandbox sharing memory from the original. This enabled seamless flows such as starting in a read-only, always-up-to-date main branch and then moving into a writable sandbox branch without any interruptions.

This approach provided:

  • A simple mental model for users: "My sandbox is always where I left it."
  • Cost-effective resource management: unused environments were automatically hibernated or archived.
  • Minimal user configuration: persistence and storage were handled behind the scenes.

2. What We Have Learned

When we pivoted the CodeSandbox product toward the SDK, the number of use cases expanded dramatically — many of which were not accounted for in our original stack or assumptions. This shift introduced both challenges in scalability and adaptability, but also surfaced valuable insights.

The core friction in the current SDK stems from our opinionated approach to hibernation and persistence.

One major issue is automatic hibernation, which is controlled by a timeout that SDK users can configure. However, this timeout is only extended by specific protocol messages sent from the SDK client. Other forms of activity — such as direct HTTP requests or file system operations — do not affect the timeout. This design has proven confusing and brittle across different usage scenarios.

The timeout is also managed inside the sandbox, making it fragile. We've encountered multiple cases where this internal state drifted or failed, causing sandboxes to remain alive beyond the intended timeout — or to hibernate unexpectedly.

Additionally, while SDK users can configure sandboxes to wake on HTTP or WebSocket connections, those interactions do not reset the timeout. This has led to frustration both in keeping the VM alive when needed and in preventing it from lingering when it shouldn’t.

In response to scaling pressures, we recently shortened the automatic archive window to 4 days. While this helped improve system stability, it also introduced less predictable restore behavior. If a sandbox is archived, it will boot from a fresh state (CLEAN) instead of a resumed memory snapshot (RESUME).

Even though this distinction is observable via the bootupType, it adds complexity for SDK integrators. They now have to:

  • Detect which state the sandbox is starting from
  • Account for significantly different startup durations
  • Causing possible end user confusion around “why my sandbox suddenly does not seem to load?”

Finally, the archive-recovery path increases the likelihood of edge-case failures, making integrations more error-prone and harder to support.

The Live Forking feature also introduced a significant scalability challenge. In some scenarios, thousands of sandboxes would simultaneously read from the memory of a single origin sandbox. This led to serious system bottlenecks and degraded performance across the platform.

As a final note, we’ve learned that SDK users — quite understandably — have done whatever they could to make our system work for their products. However, the wide range of new use cases has proven incompatible with our current hibernation, persistence, and forking behaviors. This fundamental mismatch is a key reason we've encountered so many reliability and scalability issues.

3. What We Want to Do

In short, we want to do two things:

  1. Remove hibernation timeout
  2. Remove archiving of sandboxes

What this means in practice is that SDK users will gain complete control over the sandbox lifecycle.

By default, sandboxes will persist and run indefinitely — it is now entirely up to the SDK consumer to decide when to hibernate or delete them based on their own use case and business logic.

This change brings the following benefits:

  • Resuming a sandbox will always be a true resume, leading to predictable and reliable startup times
  • Business logic can connect the sandbox before hibernation — to modify internal state or extract data
  • Different use cases can tailor persistence behavior: long-lived for some, short-lived for others
  • The overall infrastructure becomes significantly simpler and more robust

In addition to this, we will replace the existing Sandbox Agent with a new implementation purpose-built for SDK usage:

  • It will be REST-based, using WebSockets only for live terminal output
  • It will eliminate global users and sessions — there will be a single root user per sandbox
  • It will be fully decoupled from the core CodeSandbox product, allowing us to iterate, debug, and evolve the SDK experience independently

4. How to Integrate

This new type of sandbox will integrate seamlessly with the existing SDK interface. The only change will be the addition of a new delete method in the SDK. Other than that, all existing interactions remain the same.

However, these sandboxes will no longer hibernate automatically. Also to ensure sustainable resource usage, we will need to introduce a persistence cost for sandboxes. This cost encourages SDK users to manage their sandbox lifecycles explicitly by deleting them when they're no longer needed, contributing to the stability of our infrastructure.

We intend to roll out this new sandbox type behind a feature flag in the near future. In the meantime, we encourage SDK users to evaluate how manual hibernation and persistence control could work for their specific use cases.

You can already begin adapting to this model by:

  • Setting the hibernation timeout to 1 day and manually calling hibernate() when appropriate
  • Choosing to delete sandboxes that have been inactive for 2–3 days. If a user returns, you can fork a fresh sandbox from your template and reconfigure it as needed

These practices will make transitioning to this planned new model much easier when it becomes available.

5. Feedback & Collaboration

We want to take this opportunity to thank our users for sticking with us as we’ve explored and adapted to a wide variety of new use cases. One of the most important things we’ve learned is that you want to manage sandboxes like a low-level, simple resource you have full control over — not like the high-level "laptop behavior" we originally designed for the CodeSandbox product.

Your feedback has been invaluable in shaping the direction of the SDK.

As we move forward with this new model, we invite you to reach out with:

  • Comments or concerns
  • Specific use cases you'd like to discuss
  • Invitations to feedback sessions or implementation discussions

We're committed to making this transition smooth and to giving you the tools and flexibility you need.

With love, The CodeSandbox SDK Team ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment