Skip to content

Instantly share code, notes, and snippets.

@dwbutler
Created May 18, 2018 05:25
Show Gist options
  • Save dwbutler/2910806f9204b0fffa77fb27356579b8 to your computer and use it in GitHub Desktop.
Save dwbutler/2910806f9204b0fffa77fb27356579b8 to your computer and use it in GitHub Desktop.
Downtime Post Mortem
  • Downtime window: 4:41 PM EST - 4:55 PM EST
  • Total downtime: 14 minutes

"Bob" (backend engineer) was the developer on call during this incident. All names have been changed to protect the innocent. :)

Summary

Bad code which prevented server startup was deployed to production. It had not been manually tested in a dev environment, but had passed all unit and integration tests.

During the same deploy, a database migration removed a column from a table, making the old code incompatible. When the code was rolled back, the old column had to be added back manually because the Rails migration was irreversible.

Response time was delayed because Bob was out shopping, and nobody was available as a backup. He should have monitored the deploy, but got impatient when the deploy took too long.

Timeline

Bad code not caught during testing

This pull request (redacted) contains the bad code and the migration that caused the downtime. Since it was a backend change without any frontend component (yet) there was no way to manually test the change. So unit and integration tests were used to verify parts of the functionality.

Some of the functionality was not in use yet, so it was considered safe to delay writing tests for them. So the bad code was not covered by any automated tests. Furthermore, since eager loading was turned off in test mode, the bad code was never run at all.

The code was never deployed to a Heroku dev environment because there was no need to do manual testing. When the pull request was merged, it was successfully deployed to staging, but the staging app server crashed. This fact was missed until later.

A deploy that took too long

Total deployment time is normally between 20 - 25 minutes. The breakdown is 8 minutes to run tests and 6 minutes for the actual deployment. The rest of the time is Circle CI doing setup and housekeeping.

On this occasion, Bob was babysitting the deploy as usual and was getting impatient. He noticed that a step called Releasing parallel containers that are no longer needed was taking a really long time - minutes. Normally it should take seconds. He gave up waiting and went grocery shopping.

An incompatible migration

The same pull request added a database migration which removed the topic_id column from the topic_follows table and replaced it with a tag column. This was in fact not a "zero downtime" migration in the sense that the existing code on the app server was not compatible with this change. This fact wasn't realized until after the code had been deployed to production.

Initially, Airbrake reported 6 instances of the following error:

PG::UndefinedColumn: ERROR: column topic_follows.topic_id does not exist

This alerted Bob to the fact that something was wrong. Two minutes later, he received a Pingdom alert that the entire site was down. Initially, he thought that the site was down due to this error, and that somehow the new code that was pushed was still using the old column somewhere. So he believed that a simple code rollback would bring the site back up.

In fact, as he learned later, this error was caused by the existing code breaking because the column was gone, before the app had restarted. Such problems are quite difficult to predict and reason about, which is why it's best practice to not remove columns immediately, but wait until new code has been deployed first that no longer uses the column.

No one else available to help

Unfortunately, Bob was at Whole Foods.

Bob tossed the groceries and headed straight for his car. He knew it would take at least 10 minutes to get home and roll back the deploy. He checked Slack to see if anyone else could help. "Mary" (backend developer) was on vacation and "Mark" (frontend engineer based in Poland) was offline. "Matt" (product manager) was also offline but he would probably respond to a phone call, so Bob called him. He responded, but was unable to help in time because he didn't have the Heroku Toolbelt set up. By the time he had it set up, Bob was already back home.

An irreversible migration

When Bob got back, he rolled back the deploy by running:

heroku rollback -a prod

To be safe, he also added back the topic_id column. He already knew the migration was irreversible because he had seen the error come up while switching branches. Reversible migrations are good in same cases, but are often unrealistic to run in practice. So he didn't feel it was necessary to enforce a practice of reversible migrations.

He ran heroku pg:psql -a prod, googled the syntax for adding a column since he almost never does it manually, and entered:

alter table topic_follows add column topic_id int;

Three minutes after the rollback, the new app servers were up, and Pingdom reported that the site was back up.

Aftermath

After this incident, we were initially confused about what exactly had gone wrong. At first it seemed like the database migration had caused the downtime, but this was a red herring.

During his investigation, Bob discovered that he had never actually deployed his code anywhere. (He thought he had deployed it to dev1.) He checked staging and discovered that it too had a 500 error on the homepage. Checking the logs, he saw the following error message:

NoMethodError: private method `new' called for Procto:Class
  airwolf/app/services/is_following_topic.rb:3:in `<class:IsFollowingTopic>''

This led him to fix the bad code, and to investigate why this bad code wasn't caught during testing.

After fixing the issue and deploying to staging, he verified that the problem was fixed, and deployed the change to production.

Later, he ran the following SQL to remove the unneeded column:

alter table topic_follows drop column topic_id;

Lessons Learned

  • A Heroku dev environment needs to be the canary in the coal mine. Normally this is staging, but there was no monitoring on staging to verify that a deploy resulted in a functional app server. This is now in place via a Pingdom alert.
  • Eager loading has been turned on in test mode so that bad code will cause immediate failure. This should allow this kind of problem to be caught much sooner, either on the developer's machine or on Circle CI while the feature is in progress.
  • The deploy process will be sped up. Specifically, Mark is working on speeding up asset compilation. Bob will review the deploy script, remove any unnecessary steps, and contact Circle CI support about any slow steps that aren't under our control.
  • A developer will always be available in a 20 minute time window around a deploy, especially if that developer was the one who kicked off the deploy!
  • Matt will be available as a backup to run any commands needed in case a developer isn't able to get to a computer.
  • With any database migration, it is always good to have a rollback plan in mind. The riskier the change, the more fleshed out the rollback plan should be.
  • Removing a column is a risky change and should be treated as such. We have already identified tools that can automatically detect database migrations that are unsafe or require downtime, and they just need to be implemented. These tools include https://github.com/gocardless/activerecord-safer_migrations and https://github.com/ankane/strong_migrations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment