Skip to content

Instantly share code, notes, and snippets.

@bowrocker
Created September 9, 2015 19:28
Show Gist options
  • Save bowrocker/b5073cb3b2fab645ad35 to your computer and use it in GitHub Desktop.
Save bowrocker/b5073cb3b2fab645ad35 to your computer and use it in GitHub Desktop.
# INCIDENT DATE - INCIDENT TYPE
## Meeting
TBD
#### Start every PM stating the following
1. This is a blameless Post Mortem.
2. We will not focus on the past events as they pertain to "could've", "should've", etc.
3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.
### Incident Leader: Jon Anderson
## Description
It was discovered that delivery jobs from delivery.chef.co were failing on all builders, impacting work.
## Timeline
This incident began at 4:58 AM UTC on Wednesday, September 8, 2015. It was resolved at 7:17AM the same day.
Time to detect: 97 minutes, 4:58AM - 6:35AM
Time to resolve: 42 minutes, 6:35AM - 7:17AM
* 2015-06-05T04:58:00+0000: @cwebber reports jobs failing on delivery.shd.chef.co
* 2015-06-05T05:15:00+0000: @jonanderson declares an incident in slack#delivery-support and moves to #delivery-incident, is IC
* 2015-06-05T05:26:00+0000: @cwebber relates details of server upgrade process, involving changing underling mount points of delivery FS with soft links. Previous diskspace issues also reported
* 2015-06-05T05:36:00+0000: @jonanderson ssh to server; observes unrelated stack trace in delivery log
* 2015-06-05T05:38:00+0000: @jonanderson re-runs jobs; same result.
* 2015-06-05T05:38:00+0000: @dmmcown verifies SHD and delivery.chef.co running the same version of pushy
* 2015-06-05T05:45:00+0000: @afiune suggest validating git repo mount points
* 2015-06-05T05:47:00+0000: @jonanderson starts group zoom; delivery support group 3 all present, with @sf, @oliver, @cwebber
* 2015-06-05T05:57:00+0000: @cnunciato confirms SHD jobs are working fine
* 2015-06-05T06:14:00+0000: @dmccown opens a parger duty incident; pages come in to all on group 3
* 2015-06-05T06:26:00+0000: @tom notices clock skew between the delivery.chef.co chef server and builders
* 2015-06-05T06:33:00+0000: @cwebber runs ntop on all builders and chef server; job tested but still fails
* 2015-06-05T06:35:00+0000: Group debug and discussion reveils that a recent change to delivery_build has rendered the push_jobs whitelist regex in a strange state (\u character in the delivery-cmd arg, instead of '\1'). Also, version 0.3.261 of delivery has vendered a version of pushy with debug enabled in the logs, rendering it difficult to discover if jobs were making it to the builder. Team decides this also should be fixed and 261 pulled.
* 2015-06-05T06:49:00+0000: @jonanderson posts a fix plan: fix bug in delivery_build around whitelist; fix delivery to revendor the pushy cookbook to 2.4.2; cwebber is going to pull down that version to fix that debug problem preventing logging on the builders.
* 2015-06-05T06:52:00+0000: @jmorrow pushes deivery_build CR to fix the whitelist
* 2015-06-05T07:24:00+0000: @jmorrow notifies @afiune that build is bad to avoid customer install at Southwest Airlines.
* 2015-06-05T07:08:00+0000: @schisamo confirms 0.3.261 has been yanked from package cloud
* 2015-06-05T07:11:00+0000: @megs confirms she will notify customers of bad build (GE and Sun Corp)
* 2015-06-05T07:17:00+0000: @cwebber confirms that after delivery_build fix and CCRs on delivery.chef.co that all is working
* 2015-06-05T07:17:00+0000: @jonanderson posts the all-clear for the incident.
## Contributing Factor(s)
There were 2 main contributing factors to this incident:
* The delivery_build change resulted in a push_jobs whitelist on the chef server that was not able to run the delivery-cmd. Since the delivery.chef.co delivery systems were upgrades, the CCR was run on the builders and the result was that jobs could not run
* Additionally, the vendoring of the 2.4.1 pushy cookbook in delivery build 0.3.261 caused pushy to run with debug. This wiped out all the logs on the builders so the errors could not be see.
## Stabilization Steps
What specific steps and actions were taken to stabilize the issue. This
does not always entail a "fix" as further actions should be listed under
"corrective actions"
## Impact
What was the impact of the incident. This should include the total
duration of the outage if applicable.
## Corrective Actions
Action items going forward to fix the issue and reduce chance of contributing factors being an issue.
This **MUST** include owners/teams assigned to these actions to see them through, and have an issue tracked in this repository (or otherwise linked to external team kanban/issue tracker).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment