Post-mortem 8/8/2016 - 8/9/2016

Summary

my.newspring.cc was experiencing an issue loading the Give portion of the site which started about 9:30 PM on August 8 until it was resolved around 9:30 AM on August 9. During this time, you may have just seen loading indicators on the site. Several breakdowns happened which led to the error itself, and our not being notified of the error until the morning of August 9. We have identified these issues, and are currently taking steps to prevent them from happening in the future.

Technical Details

We currently manage accounts which can be given to in Rock. These accounts are queried by my.newspring.cc through our GraphQL server, which we call Heighliner. Heighliner fetches the accounts by making a direct connection to the MSSQL database. Specifically, it queries for accounts which have a PublicDescription by eliminating the ones whose PublicDescription is set to NULL. However, this was insufficient because accounts have a default PublicDescription of NULL when created, and they are updated to be a blank string ("") when someone views the account from the Rock admin and saves it.

That is what happened at around 1:45 PM on August 8. The "Clemson For Christmas"account was viewed in Rock and then saved, which updated it's PublicDescription to be a blank string. The issue didn't manifest until almost 8 hours later due to a caching mechanism we have in place. Once that cache cleared, my.newspring.cc started trying to display the account as one you could give to. Missing assets for that account caused an error on Heighliner, which in turn caused an error on the Give portion of the site. This was no fault of the user, as they didn't make any updates to the PublicDescription when they saved it. The fault lies in the way in which we currently query for accounts.

On August 8, IT was experiencing some difficulties, and had to perform some unexpected maintenance. During this time, many of the application health checks we have in place were alerting us every couple minutes. At the time, my.newspring.cc was live and functioning, although a little slow due to the active maintenance. Due to this, we made the decision to snooze our health and error checks until August 9. We did this towards the end of the workday on August 8, which is why we were not alerted to the issue until the morning of August 9.

Next Steps

The issue has been resolved for now, but we are taking steps to prevent this from happening in the future. Today, we are updating the way we query for accounts to prevent accidental inclusion of accounts on Heighliner. Over the next week we will update our client-side query system to better prevent individual errors from locking up the entire page. We are also taking a look at how our application health and error alerts are elevated so that we will be able to act on issues faster.

johnthepink/postmortem.md

Select an option

No results found

Select an option

No results found

Post-mortem 8/8/2016 - 8/9/2016

Summary

Technical Details

Next Steps