Postmortem: [Short descriptive title]

Date: YYYY-MM-DD Duration: [Time from detection to resolution] Severity: [S1: service down | S2: degraded | S3: minor | S4: cosmetic] Tags: [e.g., disk, docker, dns, networking, backup, certs]

Impact

[1–3 sentences. What broke, from the user's perspective? Even if "the user" is just you. What could you not do while this was happening?]

Timeline

Time (UTC)	Event
HH:MM	First sign of trouble (or estimated start)
HH:MM	Detection — how was it noticed?
HH:MM	Key debugging steps (what you tried, in order)
HH:MM	Root cause identified
HH:MM	Fix applied
HH:MM	Service confirmed restored

Root Cause

[2–5 sentences. The actual, technical reason this happened. Not "the server crashed" but WHY the server crashed.]

Contributing Factors

[Things that didn't directly cause the incident but made it worse or slower to detect/resolve. Missing monitoring, no runbook, stale docs, config drift, etc.]

Detection Gap

[How long was the service broken before you noticed? Why? What would have caught it sooner?]

Remediations

Action	Type	Owner	Deadline	Status
[Specific task]	prevent / detect / mitigate	you	YYYY-MM-DD	TODO
[Specific task]	prevent / detect / mitigate	you	YYYY-MM-DD	TODO

Lessons Learned

[What surprised you?]
[What went well during the response?]
[What would you do differently?]

cmbaughman/postmortem.md

Select an option

No results found