Skip to content

Instantly share code, notes, and snippets.

@pbjorklund
Created September 24, 2012 17:47
Show Gist options
  • Save pbjorklund/3777263 to your computer and use it in GitHub Desktop.
Save pbjorklund/3777263 to your computer and use it in GitHub Desktop.
Hi,
Today, the Pingdom team deployed a software upgrade to some of our
monitoring probes. Despite thorough testing, this upgrade contained a
malfunction that led to false down alerts being sent to a portion of our
customers, including you.
Even if the issue affected monitoring for less than 90 minutes for a
limited number of customers, it's of course frustrating if you were one
of them. We take a lot of pride in delivering a reliable service and
this doesn't represent what Pingdom stands for.
Let us first stress how rare it is that something like this happens at
Pingdom. In fact, this is the first time a similar occurrence has struck
us. That said, we want to take this opportunity to provide information
about what happened, present what actions we've already taken, as well
as tell you how we move forward.
Our normal deployment of new and updated software consists of a series
of tests designed to making sure that our systems are reliable. This
means that we roll out updates gradually to our infrastructure and only
after they've been thoroughly tested in our development and staging
environment.
Today at around 8 am GMT we gradually started to roll out the update to
a few selected monitoring probes. Immediately we saw that there was an
issue with the code and did a rollback. But, unfortunately, a limited
number of customers had faulty downtimes recorded in their data and in
some cases also received faulty down alerts during a limited time.
After a thorough investigation we've already initiated actions to
minimize the effect this may have had, including:
x Affected Pingdom checks will have their up and down records marked as
unmonitored for the period in question, up to a maximum of 90 minutes.
Therefore, each site's uptime record will not be affected. In other
words, your uptime percentage will not change due to this incident.
x Any lost SMS credits due to incorrect alerts in connection with this
issue have been refunded. You will receive double the amount of credits
that was used during the incident.
x We will take further steps to make sure that future upgrades to our
infrastructure will be implemented with even more caution. This
incident has already led to improvements in our deployment routines.
We want you to rest assured that all of us working at Pingdom take
significant pride in delivering the best possible service, and even
though mistakes happen they are not acceptable to us.
We're really sorry that you were affected by this. You can be sure that
someone will be wearing the stupid hat today.
Please contact us at [email protected] if you have any questions or
comments.
The Pingdom Team
www.pingdom.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment