Created
June 4, 2014 14:40
-
-
Save seldo/999b4b3a97e298723c3d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
07:04 seldo: For our outage report, to clarify: the whole DC with the shield cache in it went down? | |
07:04 seldo: You said you lost "v37" but I don't know if that's a box or a cluster or a whole DC | |
07:05 Mithrandir: it's a machine. | |
07:05 Mithrandir: (a single machine) | |
07:05 seldo: So is that machine a SPoF was there a secondary cause of the shield failure? | |
07:05 seldo: *or | |
07:07 Mithrandir: we turn off shielding on varnish restarts, since the shielding doesn't help until there's content in the cache. We've been rolling out fixes for some bugs recently (and hitting some bugs), so we ended up with all machines in ASH having shielding turned off (since they had recently been restarted). | |
07:07 Mithrandir: cache-v37 was the final machine with shielding turned on in ASH | |
07:08 Mithrandir: so what we're doing to avoid this happening again is we'll be forcing shielding on early if we get below threshold. | |
07:09 seldo: Ahhh, okay, that explanation makes sense | |
07:09 seldo: I was definitely confused how a single box loss could do this | |
07:09 seldo: But I totally see how that could happen now | |
07:09 Mithrandir: yes, it was just the straw that broke the proverbial camel's back. | |
07:10 Mithrandir: even with that fix, you'll still be vulnerable to us taking a whole dc offline, so we want to come up with a fix for that. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment