Websocket Deployments — Enemy of the State

How do you deploy new code continuously into a system that needs to support stateful, uninterruptible computations?

Some workloads are inherently stateful

As web developers, we're used to valuing stateless computation, for good reason – reducing state makes everything easier. Servers (or cloud instances) hold no essential state, delegating any important data to databases like Postgres and Redis. This is a great architecture. Among other benefits, it makes zero-downtime deployments a breeze: to deploy a new code version, you simply spin up a set of new instances with the new git sha, add them to the load balancer, and kill the old instances after draining them of traffic (perhaps after 60 s).

However, some workloads don't fit this "short-duration request & response" mold. As an example, consider connecting OpenAI Realtime to Twilio's API, which requires you to:

accept a websocket connection from Twilio
listen to messages on the websocket, which contain bits of audio
after making some optional changes, relay the bits of audio on to OpenAI Realtime, again via a websocket
keep doing this for the duration of the call, which could last up to 30 minutes

Note that we're dealing with two pieces of essential state which can't be handed off to a database:

An incoming websocket connection from Twilio
An outgoing websocket connection to OpenAI

If we drop either of those connections due to a server restart, we inevitably drop the ongoing call and end up with a bad customer experience.

How do we square continuous deployment with stateful services?

We now have two desiderata:

We want to be able to deploy as often as necessary. In a busy monolith, we might deploy new code 10 times per hour, or more.
We need to keep a stateful process active for up to 30 minute which cannot be interrupted.

How can you make both of these possible at once? The problem, of course, is that when a new code version is deployed, the server is typically stopped and restarted, or a new cloud instance is created which cannot inherit open TCP connections from the old instance. I think this is an interesting engineering challenge.

I'd like to hear from you

Like any engineer, when I hear a problem like this, my mind starts coming up with possible solutions. (Would AWS lambda work for this use case? It might help with the incoming websocket, but I think it won't be able to maintain the outgoing websocket connection.) But instead of writing up my half-baked thoughts, I'm curious what you think. Let me know in the comments below!

pesterhazy/enemy-of-the-state.md

Websocket Deployments — Enemy of the State

Some workloads are inherently stateful

How do we square continuous deployment with stateful services?

I'd like to hear from you

pesterhazy commented Dec 7, 2024

Uh oh!

pesterhazy commented May 9, 2025

Uh oh!