El Presidente and I had a working session to design a feature flag system for nexus-go to support gradual infrastructure rollouts, specifically migrating from Redis to Valkey. The challenge is that this application handles about 20M events in 10-second windows across 150 production hosts, so performance is absolutely critical.
Initially we took the obvious approach - on-demand DynamoDB lookups with local caching. Each feature flag check would hit the local cache first, then fall back to DynamoDB on a cache miss. This seemed reasonable until we did the math on the scale. At 2M events per second, even with 1-minute caching, cache misses for new features could generate millions of simultaneous DynamoDB calls. We briefly considered switching from app_id-based flags to app_stage-based flags (using the MUSTER_PROFILE env var) to reduce the lookup space, but we realized we'd still have the same rate limiting issues - every single event would still need to check the feature flag, just with a different key. We actual