Skip to content

Instantly share code, notes, and snippets.

@PickledDragon
Last active November 10, 2024 14:26
Show Gist options
  • Save PickledDragon/13323a0af4f7772ddbf06b243a51e0ff to your computer and use it in GitHub Desktop.
Save PickledDragon/13323a0af4f7772ddbf06b243a51e0ff to your computer and use it in GitHub Desktop.

Payment system (Owned by Team 0) were failing intermittently between 05:00h and 09:00h CET

  1. Why did the payment system fail?
  • Resource exhaustion event on a multi-tenant, distributed database cluster
  1. Why did the resource exhaustion occur?
  • Team 1 executed a change for data clean up
  1. Why did Team 1 execute the data clean up?
  • They were outside their service window and their tests showed no problem in lower environments
  1. Why did tests show no problem in lower environments?
  • Traffic in Production was not comparable traffic in lower environments
  1. Why was there no comparable traffic in Lower environments?
  • Acceptance environment is production sans users by definition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment