Date: YYYY-MM-DD Duration: [Time from detection to resolution] Severity: [S1: service down | S2: degraded | S3: minor | S4: cosmetic] Tags: [e.g., disk, docker, dns, networking, backup, certs]
[1–3 sentences. What broke, from the user's perspective? Even if "the user" is just you. What could you not do while this was happening?]
| Time (UTC) | Event |
|---|---|
| HH:MM | First sign of trouble (or estimated start) |
| HH:MM | Detection — how was it noticed? |
| HH:MM | Key debugging steps (what you tried, in order) |
| HH:MM | Root cause identified |
| HH:MM | Fix applied |
| HH:MM | Service confirmed restored |
[2–5 sentences. The actual, technical reason this happened. Not "the server crashed" but WHY the server crashed.]
- [Things that didn't directly cause the incident but made it worse or slower to detect/resolve. Missing monitoring, no runbook, stale docs, config drift, etc.]
[How long was the service broken before you noticed? Why? What would have caught it sooner?]
| Action | Type | Owner | Deadline | Status |
|---|---|---|---|---|
| [Specific task] | prevent / detect / mitigate | you | YYYY-MM-DD | TODO |
| [Specific task] | prevent / detect / mitigate | you | YYYY-MM-DD | TODO |
- [What surprised you?]
- [What went well during the response?]
- [What would you do differently?]