Skip to content

Instantly share code, notes, and snippets.

@seldo
Created June 4, 2014 14:40
Show Gist options
  • Save seldo/999b4b3a97e298723c3d to your computer and use it in GitHub Desktop.
Save seldo/999b4b3a97e298723c3d to your computer and use it in GitHub Desktop.
07:04 seldo: For our outage report, to clarify: the whole DC with the shield cache in it went down?
07:04 seldo: You said you lost "v37" but I don't know if that's a box or a cluster or a whole DC
07:05 Mithrandir: it's a machine.
07:05 Mithrandir: (a single machine)
07:05 seldo: So is that machine a SPoF was there a secondary cause of the shield failure?
07:05 seldo: *or
07:07 Mithrandir: we turn off shielding on varnish restarts, since the shielding doesn't help until there's content in the cache. We've been rolling out fixes for some bugs recently (and hitting some bugs), so we ended up with all machines in ASH having shielding turned off (since they had recently been restarted).
07:07 Mithrandir: cache-v37 was the final machine with shielding turned on in ASH
07:08 Mithrandir: so what we're doing to avoid this happening again is we'll be forcing shielding on early if we get below threshold.
07:09 seldo: Ahhh, okay, that explanation makes sense
07:09 seldo: I was definitely confused how a single box loss could do this
07:09 seldo: But I totally see how that could happen now
07:09 Mithrandir: yes, it was just the straw that broke the proverbial camel's back.
07:10 Mithrandir: even with that fix, you'll still be vulnerable to us taking a whole dc offline, so we want to come up with a fix for that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment