Something is down! What should we do?
First: Breathe. It’s just tech. We’ll fix it.
Slow down. Hasty decisions won’t help us be operational again. In fact, they often lead to making more mistakes. Remember to "Go slow to go fast". We have time to think before we act.
The formal process described here helps us slow down and think. When possible an oncall engineer should be paired with a more senior or experienced mentor engineer, and the oncall engineer should try to execute as much of this process themselves. The oncall/mentor pairing primarily intends for mentors to support fellow engineers in their oncall and engineering journey, providing a backstop only when necessary.
- Discover a potential problem
- Acknowledge any alerts
- Declare an incident
- Create a communication channel for the incident
- Create a new incident report to record the event
- Escalate and assign roles
- Assign roles as necessary
- Escalate to other teams if necessary
- Communicate and Act
- Work the problem
- Document the incident as you go
- Communicate regularly with the rest of the company and users
- Resolve the incident
- Affirm that the problem is resolved
- Schedule an Incident Retrospective
Problems can be discovered by automated monitoring systems, by employees, and by customers. When a potential problem is found, report it with the oncall engineer and alert your global incident channel (if one exists).
Finding the oncall engineer
If you are not the oncall engineer and need to discuss the problem with them, [INSERT YOUR ONCALL LOOKUP PROCESS HERE].
If you determine the problem degrades service for customers or downstream users, declare it to be an incident.
If your company has a chat service (Slack for example), create a new incident channel with the date in its name: incident-2020-02-02
. Move all further discussion to that channel.
Create a new incident using your team's Incident Report template. This isn't just a report for reporting's sake, it's a constantly updated log of what happens during the incident. This is critical for doing later Incident Retrospectives. An incident report should have pre-formatted fields ready to go for documenting participants, video call transcriptions or videos, links to chat channels, ongoing notes about the incident, and more. It will eventually become the base of your RCA.
Every high severity incident has three roles with responsibilities. By default the oncall engineer is responsible for performing all three roles (Control, Ops, and Comms) until the incident is resolved or the issue is deemed severe enough to involve more people. At that point, the oncall engineer must ensure each of the roles is handed off to another person.
If the oncall engineer is not familiar with the problem domain, they should escalate as soon as possible to a domain expert. Do try to troubleshoot and take action to the best of your ability, but don't delay escalation if you find yourself in a rabbit hole.
This role is performed by one person at a time. The incident controller manages the high-level state of the incident and makes key decisions about what actions to take. To do this effectively, the controller needs to have good information about the problem: help if you can by reporting useful information to the incident channel! Do not take any major actions (actions which would meaningfully change the state of our systems) without instruction from the controller. The controller will do a better job if they can, well, control the actions we take and when we take them. Do keep the controller informed about what has been done, and what the current status is.
This person might be a Manager or PM, but it's ok and encouraged for it to be an engineer as well. Engineers should take every opportunity to own their services. This person should NOT be a director or executive, but should be reporting directly to people in those roles on a regular cadence (perhaps every 20 to 30 minutes).
Responsibilities include:
-
Setting the priority and or severity level of the incident
-
Escalating to other teams and engineers
-
Keeping track of work being done by incident ops
-
Triaging which actions to take
-
Approving changes before they are made and obtaining change request approval if needed.
This role is performed by one or more engineers. Incident Operators investigate and implement changes. They should be people with domain expertise, whatever the relevant domain is. If we need more than one person acting on instructions from the controller, do delegate those tasks formally, in the chat channel. When the task is completed, say so in the channel. We do this formally so everyone knows who is responsible for any one action, to keep communication overhead low.
Responsibilities include:
-
Investigating technical details of the problem
-
Providing suggested actions to the incident controller
-
Implementing any changes approved by the incident controller
-
Updating the incident timeline with any actions they take
This secretarial role is performed by one person at a time. The incident communicator keeps track of the incident's status and handles communication.
Responsibilities include:
-
Documenting the incident as it happens by making and updating the incident report.
-
Starting any video meetings if required and recording them
-
Summarizing events in meetings in the incident report
-
Ensuring Incident Operators also document their actions
-
Providing status updates to the rest of the company and to users if appropriate
Do assign these roles and stick with them. If a person needs to swap in or out of a role, declare it plainly and publicly in the incident channel.
First and foremost, work the problem! Identify the problem in clear language and then begin to ask questions about what you are seeing. When did the problem begin? Has something changed? Ask questions about what you are seeing to help you develop a clear understanding of what is happening.
Keep communication in channel If you discuss the problem with someone outside of your comms channel or meetings, report the results of your discussion to the incident channel, or the discussion did not happen as far as the record is concerned. The incident channel discussion also serves as a timestamped log we'll use for Incident Retrospectives.
Stay on point
Don’t post memes in the incident channel. Keep the discussion informative and relevant. Move longer discussions to threads. It helps to be more formal and polite. Explicitly acknowledge requests and responses. Acknowledgements help us know that we've been heard, so we don't have to repeat ourselves.
Work your role
If you have not been assigned an incident role, you’re welcome to watch quietly in the channel. If you have questions, address them to the communicator using another channel, or a private message. If you have important relevant information, do post it in the channel. Use a thread or talk to the communicator privately if you're not sure.
Practice affirmative confirmation
When accepting a role, exiting a role, or executing an action or a change for the controller, make sure that all parties involved have affirmed what is happening. This means when somebody says "I have done X", others acknowledge out loud that they know this was done ("You did X, got it"). This strategy prevents confusion, avoids missed updates or questions in a busy chat (lonely questions), and ensures that everyone is acting with the same understanding.
Joe: Hi everyone, I need to hand off incident control to the next oncall engineer. Sue, will you take over as incident controller?
Sue: Yes, as of now I'm the incident controller.
When we have reached a point where the problem is fixed and the business is operating as expected, the incident controller must declare it to be resolved. This is not the end, however. The incident controller is responsible for scheduling an Incident Retrospective with all involved parties.
An incident is not closed until we’ve run an Incident Retrospective. We do this so we can learn from incidents and not repeat them. Our goal is to have fresh, new, inventive incidents: repeating mistakes is a sign we haven't learned from them. The Incident Report and the channel history are the documentation for what happened during an incident, and the foundation for our Retrospective. Do what you can to improve both.
For smaller or less severe incidents, we won't invoke the full protocol described here, but we will keep its spirit in mind. The more serious it gets, the more formal we will become. The more serious an outage is, the deeper the temptation we'll feel to rush actions, or just do something to make it better. That's when we really do need to slow down and trust the process.
Remember: breathe. It’s okay to take time to think & consult. We’ll fix it.