Skip to content

Instantly share code, notes, and snippets.

@rogerhub
Last active March 4, 2016 09:01
Show Gist options
  • Save rogerhub/06b95b2698b4de0824dc to your computer and use it in GitHub Desktop.
Save rogerhub/06b95b2698b4de0824dc to your computer and use it in GitHub Desktop.
RogerHub Monday 12/7/2015 Outage

RogerHub Monday 12/7/2015 Outage

The Final Grade Calculator was broken from approximately 2:21AM PST to 11:22AM PST (9 hours 1 minute) on December 7th 2015.

Root cause

The JavaScript for the Final Grade Calculator was out of date.

Problem 1: At around 2AM PST, I began seeing increased ping delays and dropped packets to the Dallas, TX Linode datacenter where RogerHub.com is hosted. At 2:21AM PST, I decided to invoke the failover mechanism and transfer the live site to a standby server running in the Fremont, CA Linode datacenter. HTTP traffic from the Dallas, TX server was routed to the Fremont, CA server transparently, while the DNS records were updated on Route53.

I verified that the site worked and the administration backend was consistent, and after the Dallas, TX server became available again, I set up the Dallas, TX server in standby mode, so another failover could be performed if needed.

Problem 2: When I made changes to the Final Grade Calculator a few weeks ago, I updated the HTML within WordPress and updated the JavaScript that is hosted statically. The HTML propagates to the standby server via MySQL replication, while the JavaScript needs to be deployed manually. I mistakenly thought that JavaScript would be synced automatically by the same mechanism that syncs file system updates to the standby server, so the JavaScript on the standby server was never updated.

Problem 3: While I had monitoring for the Final Grade Calculator's HTML, I did not have any monitoring that would actually try to use the tool or to check for JavaScript errors that were unrelated to advertising.

Problem 4: I did not try to use the Final Grade Calculator after the failover operation, or else I would have noticed it did not work.

Resolution

I updated the JavaScript on the standby server by deploying it.

Next steps

It was good to have tested the failover mechanism during a low traffic period (early in the morning for the United States). However, there are more steps that can be done to improve this.

  • Add monitoring for network latency or dropped packets to RogerHub.com.
  • Add monitoring that actually tries to use the Final Grade Calculator.
  • Add monitoring for 404's and other HTTP errors on RogerHub.
  • Make sure to always deploy HTML and JavaScript to both the master and the standby server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment