bowrocker · September 9, 2015 19:28
diff --git a/gistfile1.txt b/gistfile1.txt
 # INCIDENT DATE - INCIDENT TYPE

 ## Meeting

 TBD

 #### Start every PM stating the following

 1. This is a blameless Post Mortem.
 2. We will not focus on the past events as they pertain to "could've", "should've", etc.
 3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

 ### Incident Leader: Jon Anderson

 ## Description

 It was discovered that delivery jobs from delivery.chef.co were failing on all builders, impacting work.

 ## Timeline

 This incident began at 4:58 AM UTC on Wednesday, September 8, 2015. It was resolved at 7:17AM the same day.

 Time to detect: 97 minutes, 4:58AM - 6:35AM

 Time to resolve: 42 minutes, 6:35AM - 7:17AM

 * 2015-06-05T04:58:00+0000: @cwebber reports jobs failing on delivery.shd.chef.co
 * 2015-06-05T05:15:00+0000: @jonanderson declares an incident in slack#delivery-support and moves to #delivery-incident, is IC
 * 2015-06-05T05:26:00+0000: @cwebber relates details of server upgrade process, involving changing underling mount points of delivery FS with soft links. Previous diskspace issues also reported
 * 2015-06-05T05:36:00+0000: @jonanderson ssh to server; observes unrelated stack trace in delivery log
 * 2015-06-05T05:38:00+0000: @jonanderson re-runs jobs; same result.
 * 2015-06-05T05:38:00+0000: @dmmcown verifies SHD and delivery.chef.co running the same version of pushy
 * 2015-06-05T05:45:00+0000: @afiune suggest validating git repo mount points
 * 2015-06-05T05:47:00+0000: @jonanderson starts group zoom; delivery support group 3 all present, with @sf, @oliver, @cwebber
 * 2015-06-05T05:57:00+0000: @cnunciato confirms SHD jobs are working fine
 * 2015-06-05T06:14:00+0000: @dmccown opens a parger duty incident; pages come in to all on group 3
 * 2015-06-05T06:26:00+0000: @tom notices clock skew between the delivery.chef.co chef server and builders
 * 2015-06-05T06:33:00+0000: @cwebber runs ntop on all builders and chef server; job tested but still fails
 * 2015-06-05T06:35:00+0000: Group debug and discussion reveils that a recent change to delivery_build has rendered the push_jobs whitelist regex in a strange state (\u character in the delivery-cmd arg, instead of '\1'). Also, version 0.3.261 of delivery has vendered a version of pushy with debug enabled in the logs, rendering it difficult to discover if jobs were making it to the builder. Team decides this also should be fixed and 261 pulled.
 * 2015-06-05T06:49:00+0000: @jonanderson posts a fix plan: fix bug in delivery_build around whitelist; fix delivery to revendor the pushy cookbook to 2.4.2; cwebber is going to pull down that version to fix that debug problem preventing logging on the builders.
 * 2015-06-05T06:52:00+0000: @jmorrow pushes deivery_build CR to fix the whitelist 
 * 2015-06-05T07:24:00+0000: @jmorrow notifies @afiune that build is bad to avoid customer install at Southwest Airlines.
 * 2015-06-05T07:08:00+0000: @schisamo confirms 0.3.261 has been yanked from package cloud
 * 2015-06-05T07:11:00+0000: @megs confirms she will notify customers of bad build (GE and Sun Corp)
 * 2015-06-05T07:17:00+0000: @cwebber confirms that after delivery_build fix and CCRs on delivery.chef.co that all is working
 * 2015-06-05T07:17:00+0000: @jonanderson posts the all-clear for the incident.

 ## Contributing Factor(s)

 There were 2 main contributing factors to this incident:

 * The delivery_build change resulted in a push_jobs whitelist on the chef server that was not able to run the delivery-cmd. Since the delivery.chef.co delivery systems were upgrades, the CCR was run on the builders and the result was that jobs could not run
 * Additionally, the vendoring of the 2.4.1 pushy cookbook in delivery build 0.3.261 caused pushy to run with debug. This wiped out all the logs on the builders so the errors could not be see.

 ## Stabilization Steps

 What specific steps and actions were taken to stabilize the issue.  This
 does not always entail a "fix" as further actions should be listed under
 "corrective actions"

 ## Impact

 What was the impact of the incident.  This should include the total
 duration of the outage if applicable.

 ## Corrective Actions

 Action items going forward to fix the issue and reduce chance of contributing factors being an issue.

 This **MUST** include owners/teams assigned to these actions to see them through, and have an issue tracked in this repository (or otherwise linked to external team kanban/issue tracker).
	# INCIDENT DATE - INCIDENT TYPE

	## Meeting

	TBD

	#### Start every PM stating the following

	1. This is a blameless Post Mortem.
	2. We will not focus on the past events as they pertain to "could've", "should've", etc.
	3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

	### Incident Leader: Jon Anderson

	## Description

	It was discovered that delivery jobs from delivery.chef.co were failing on all builders, impacting work.

	## Timeline

	This incident began at 4:58 AM UTC on Wednesday, September 8, 2015. It was resolved at 7:17AM the same day.

	Time to detect: 97 minutes, 4:58AM - 6:35AM

	Time to resolve: 42 minutes, 6:35AM - 7:17AM

	* 2015-06-05T04:58:00+0000: @cwebber reports jobs failing on delivery.shd.chef.co
	* 2015-06-05T05:15:00+0000: @jonanderson declares an incident in slack#delivery-support and moves to #delivery-incident, is IC
	* 2015-06-05T05:26:00+0000: @cwebber relates details of server upgrade process, involving changing underling mount points of delivery FS with soft links. Previous diskspace issues also reported
	* 2015-06-05T05:36:00+0000: @jonanderson ssh to server; observes unrelated stack trace in delivery log
	* 2015-06-05T05:38:00+0000: @jonanderson re-runs jobs; same result.
	* 2015-06-05T05:38:00+0000: @dmmcown verifies SHD and delivery.chef.co running the same version of pushy
	* 2015-06-05T05:45:00+0000: @afiune suggest validating git repo mount points
	* 2015-06-05T05:47:00+0000: @jonanderson starts group zoom; delivery support group 3 all present, with @sf, @oliver, @cwebber
	* 2015-06-05T05:57:00+0000: @cnunciato confirms SHD jobs are working fine
	* 2015-06-05T06:14:00+0000: @dmccown opens a parger duty incident; pages come in to all on group 3
	* 2015-06-05T06:26:00+0000: @tom notices clock skew between the delivery.chef.co chef server and builders
	* 2015-06-05T06:33:00+0000: @cwebber runs ntop on all builders and chef server; job tested but still fails
	* 2015-06-05T06:35:00+0000: Group debug and discussion reveils that a recent change to delivery_build has rendered the push_jobs whitelist regex in a strange state (\u character in the delivery-cmd arg, instead of '\1'). Also, version 0.3.261 of delivery has vendered a version of pushy with debug enabled in the logs, rendering it difficult to discover if jobs were making it to the builder. Team decides this also should be fixed and 261 pulled.
	* 2015-06-05T06:49:00+0000: @jonanderson posts a fix plan: fix bug in delivery_build around whitelist; fix delivery to revendor the pushy cookbook to 2.4.2; cwebber is going to pull down that version to fix that debug problem preventing logging on the builders.
	* 2015-06-05T06:52:00+0000: @jmorrow pushes deivery_build CR to fix the whitelist
	* 2015-06-05T07:24:00+0000: @jmorrow notifies @afiune that build is bad to avoid customer install at Southwest Airlines.
	* 2015-06-05T07:08:00+0000: @schisamo confirms 0.3.261 has been yanked from package cloud
	* 2015-06-05T07:11:00+0000: @megs confirms she will notify customers of bad build (GE and Sun Corp)
	* 2015-06-05T07:17:00+0000: @cwebber confirms that after delivery_build fix and CCRs on delivery.chef.co that all is working
	* 2015-06-05T07:17:00+0000: @jonanderson posts the all-clear for the incident.

	## Contributing Factor(s)

	There were 2 main contributing factors to this incident:

	* The delivery_build change resulted in a push_jobs whitelist on the chef server that was not able to run the delivery-cmd. Since the delivery.chef.co delivery systems were upgrades, the CCR was run on the builders and the result was that jobs could not run
	* Additionally, the vendoring of the 2.4.1 pushy cookbook in delivery build 0.3.261 caused pushy to run with debug. This wiped out all the logs on the builders so the errors could not be see.

	## Stabilization Steps

	What specific steps and actions were taken to stabilize the issue. This
	does not always entail a "fix" as further actions should be listed under
	"corrective actions"

	## Impact

	What was the impact of the incident. This should include the total
	duration of the outage if applicable.

	## Corrective Actions

	Action items going forward to fix the issue and reduce chance of contributing factors being an issue.

	This MUST include owners/teams assigned to these actions to see them through, and have an issue tracked in this repository (or otherwise linked to external team kanban/issue tracker).