Created
September 24, 2012 17:47
-
-
Save pbjorklund/3777263 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Hi, | |
| Today, the Pingdom team deployed a software upgrade to some of our | |
| monitoring probes. Despite thorough testing, this upgrade contained a | |
| malfunction that led to false down alerts being sent to a portion of our | |
| customers, including you. | |
| Even if the issue affected monitoring for less than 90 minutes for a | |
| limited number of customers, it's of course frustrating if you were one | |
| of them. We take a lot of pride in delivering a reliable service and | |
| this doesn't represent what Pingdom stands for. | |
| Let us first stress how rare it is that something like this happens at | |
| Pingdom. In fact, this is the first time a similar occurrence has struck | |
| us. That said, we want to take this opportunity to provide information | |
| about what happened, present what actions we've already taken, as well | |
| as tell you how we move forward. | |
| Our normal deployment of new and updated software consists of a series | |
| of tests designed to making sure that our systems are reliable. This | |
| means that we roll out updates gradually to our infrastructure and only | |
| after they've been thoroughly tested in our development and staging | |
| environment. | |
| Today at around 8 am GMT we gradually started to roll out the update to | |
| a few selected monitoring probes. Immediately we saw that there was an | |
| issue with the code and did a rollback. But, unfortunately, a limited | |
| number of customers had faulty downtimes recorded in their data and in | |
| some cases also received faulty down alerts during a limited time. | |
| After a thorough investigation we've already initiated actions to | |
| minimize the effect this may have had, including: | |
| x Affected Pingdom checks will have their up and down records marked as | |
| unmonitored for the period in question, up to a maximum of 90 minutes. | |
| Therefore, each site's uptime record will not be affected. In other | |
| words, your uptime percentage will not change due to this incident. | |
| x Any lost SMS credits due to incorrect alerts in connection with this | |
| issue have been refunded. You will receive double the amount of credits | |
| that was used during the incident. | |
| x We will take further steps to make sure that future upgrades to our | |
| infrastructure will be implemented with even more caution. This | |
| incident has already led to improvements in our deployment routines. | |
| We want you to rest assured that all of us working at Pingdom take | |
| significant pride in delivering the best possible service, and even | |
| though mistakes happen they are not acceptable to us. | |
| We're really sorry that you were affected by this. You can be sure that | |
| someone will be wearing the stupid hat today. | |
| Please contact us at [email protected] if you have any questions or | |
| comments. | |
| The Pingdom Team | |
| www.pingdom.com |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment