This should be a short (4-5 sentences) blurb which succinctly describes the event. At the very least, it should include:
- the duration (with start & end times in the U.S. Pacific time zone):
…which lasted for roughly 10 minutes between 9:02PM and 9:12PM Pacific…
- the impact to our users:
…resulting in requests returning 5xx errors, at some points during the event peaking at 93% of requests…
- the root cause of the incident:
After further investigation, our engineers have determined that a cascading failure in the datacenter’s battery backups caused instances to halt…
-
Use human-readable timestamps—e.g.:
Apr 29, 2016 9:02 AM Pacific
- NOTE: Make sure to include the time zone — we have remote folks in other time zones who will be reading this…
-
Cover the entirety of the outage duration, including:
- when outage began
- when staff was notified
- any actions, events, etc
- when service was restored/outage ended
Give as detailed an explanation of the event as you can. Do NOT sugarcoat—postmortems are always to be conducted in a 100% blame-free environment, so don’t be afraid to state the facts.
Give as detailed an explanation of actions taken (including human-readable timestamps) by both internal and external/third parties.
Create/maintain an itemized list of ways to prevent this outage from happening again. These items should ideally answer the question: “What can we do better next time?”