triage (pager duty rotation)
One person at a time is on-call. We use pagerduty to coordinate this, and automated/manual triggers of possible incidents go to their phone (day or night). It’s their job to decide what level of incident this is, and if needed, start the appropriate response. They should feel comfortable summoning anyone else to help triage, or prioritize.
A feature is broken, or not working as expected. A data provider (contacted through pipeline) is down create a ticket. Talk to PMs in #triage communicate with support/customers about expected delays
Create a ticket, regular work flow
A small subset of customers are unable to do checks.
This is handled right away, during working hours, but can be left till the morning if outside of working hours
- Hemorrhaging Money/Data/Reputation
- The system (or a vertical of our system) is not usable by most of our customers.
Start an Incident Response (see below)
The rest of this doc is about responding to HIGH blast radius incidents.
At the start of the incident the person on call wears all hats. It is their responsibility to deligate these responsibilities, and not get overwelmed. Another point to deligate is when they need to rotate out and get some rest (don’t wear yourself out).
assign roles
make sure the right people are in the room
make sure the team is diagnosing and moving towards a solution
don’t do the work
coordinate the people
request more people when needed
rotate out people when they are tired
release people when they are not needed
coordinate the post mortem(s)
dev wide for review
company wide if needed for transparency
coordinate the tickets, and long term work need to avoid these issues in the future
periodic updates to the rest of the company when appropriate (every 15min is a good rule of thumb)
getting the approriate expectations to CS and customers (the status page if/when we get one)
pin a convo in #fires with the current roles so people know
Usually Devs and DevOps folks.
figure out what is going wrong
figure out the bandaid solution (if necessary)
do bandaid solution
communicate long term fixes needed (during a post mortem)
Anyone who feels comfortable doing so. If you are new at this, ask for a shadow and feedback afterwards.
For anyone interested in joining the rotation, talk to your team lead and/or someone on rotation.
The process for training would be roughly:
onboarding discussion with someone from the rotation
adding you to the rotation when/if you feel ready
have a shadow (someone already on rotation) during your rotations until you feel comfortable to go at it alone.