Created
September 9, 2015 19:28
-
-
Save bowrocker/b5073cb3b2fab645ad35 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# INCIDENT DATE - INCIDENT TYPE | |
## Meeting | |
TBD | |
#### Start every PM stating the following | |
1. This is a blameless Post Mortem. | |
2. We will not focus on the past events as they pertain to "could've", "should've", etc. | |
3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item. | |
### Incident Leader: Jon Anderson | |
## Description | |
It was discovered that delivery jobs from delivery.chef.co were failing on all builders, impacting work. | |
## Timeline | |
This incident began at 4:58 AM UTC on Wednesday, September 8, 2015. It was resolved at 7:17AM the same day. | |
Time to detect: 97 minutes, 4:58AM - 6:35AM | |
Time to resolve: 42 minutes, 6:35AM - 7:17AM | |
* 2015-06-05T04:58:00+0000: @cwebber reports jobs failing on delivery.shd.chef.co | |
* 2015-06-05T05:15:00+0000: @jonanderson declares an incident in slack#delivery-support and moves to #delivery-incident, is IC | |
* 2015-06-05T05:26:00+0000: @cwebber relates details of server upgrade process, involving changing underling mount points of delivery FS with soft links. Previous diskspace issues also reported | |
* 2015-06-05T05:36:00+0000: @jonanderson ssh to server; observes unrelated stack trace in delivery log | |
* 2015-06-05T05:38:00+0000: @jonanderson re-runs jobs; same result. | |
* 2015-06-05T05:38:00+0000: @dmmcown verifies SHD and delivery.chef.co running the same version of pushy | |
* 2015-06-05T05:45:00+0000: @afiune suggest validating git repo mount points | |
* 2015-06-05T05:47:00+0000: @jonanderson starts group zoom; delivery support group 3 all present, with @sf, @oliver, @cwebber | |
* 2015-06-05T05:57:00+0000: @cnunciato confirms SHD jobs are working fine | |
* 2015-06-05T06:14:00+0000: @dmccown opens a parger duty incident; pages come in to all on group 3 | |
* 2015-06-05T06:26:00+0000: @tom notices clock skew between the delivery.chef.co chef server and builders | |
* 2015-06-05T06:33:00+0000: @cwebber runs ntop on all builders and chef server; job tested but still fails | |
* 2015-06-05T06:35:00+0000: Group debug and discussion reveils that a recent change to delivery_build has rendered the push_jobs whitelist regex in a strange state (\u character in the delivery-cmd arg, instead of '\1'). Also, version 0.3.261 of delivery has vendered a version of pushy with debug enabled in the logs, rendering it difficult to discover if jobs were making it to the builder. Team decides this also should be fixed and 261 pulled. | |
* 2015-06-05T06:49:00+0000: @jonanderson posts a fix plan: fix bug in delivery_build around whitelist; fix delivery to revendor the pushy cookbook to 2.4.2; cwebber is going to pull down that version to fix that debug problem preventing logging on the builders. | |
* 2015-06-05T06:52:00+0000: @jmorrow pushes deivery_build CR to fix the whitelist | |
* 2015-06-05T07:24:00+0000: @jmorrow notifies @afiune that build is bad to avoid customer install at Southwest Airlines. | |
* 2015-06-05T07:08:00+0000: @schisamo confirms 0.3.261 has been yanked from package cloud | |
* 2015-06-05T07:11:00+0000: @megs confirms she will notify customers of bad build (GE and Sun Corp) | |
* 2015-06-05T07:17:00+0000: @cwebber confirms that after delivery_build fix and CCRs on delivery.chef.co that all is working | |
* 2015-06-05T07:17:00+0000: @jonanderson posts the all-clear for the incident. | |
## Contributing Factor(s) | |
There were 2 main contributing factors to this incident: | |
* The delivery_build change resulted in a push_jobs whitelist on the chef server that was not able to run the delivery-cmd. Since the delivery.chef.co delivery systems were upgrades, the CCR was run on the builders and the result was that jobs could not run | |
* Additionally, the vendoring of the 2.4.1 pushy cookbook in delivery build 0.3.261 caused pushy to run with debug. This wiped out all the logs on the builders so the errors could not be see. | |
## Stabilization Steps | |
What specific steps and actions were taken to stabilize the issue. This | |
does not always entail a "fix" as further actions should be listed under | |
"corrective actions" | |
## Impact | |
What was the impact of the incident. This should include the total | |
duration of the outage if applicable. | |
## Corrective Actions | |
Action items going forward to fix the issue and reduce chance of contributing factors being an issue. | |
This **MUST** include owners/teams assigned to these actions to see them through, and have an issue tracked in this repository (or otherwise linked to external team kanban/issue tracker). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment