Skip to content

Instantly share code, notes, and snippets.

@seldo
Created June 12, 2014 07:57
Show Gist options
  • Save seldo/e882400411bb54b485f7 to your computer and use it in GitHub Desktop.
Save seldo/e882400411bb54b485f7 to your computer and use it in GitHub Desktop.
00:10 seldo: all: we are seeing another huge burst of 503s
00:10 seldo: What's up?
00:11 pwohlers: checking
00:14 seldo: All our internal checks are green and I have manually verified no problems hitting our servers
00:17 pwohlers: seldo - looks like we're seeing packet loss out of SJC
00:17 othiym23: pwohlers: fwiw, we (npm) are getting error reports from users in Australia, who are also saying they're having problems with reddit
00:17 othiym23: not implying causation, just a data point
00:18 pwohlers: thanks
00:18 seldo: I am seeing 503s in our logs (from fastly) from every data center: FRA, SJC, LAZ, ORD, LCY, AMS, etc.
00:19 pwohlers: where are your origins seldo ?
00:19 seldo: spread across AWS us-west-2 and us-east-1
00:19 pwohlers: ok
00:20 seldo: aaaand suddenly they stopped
00:20 seldo: www and registry are immediately back to full health
00:21 pwohlers: good
00:21 seldo: What just happened?
00:22 dormando: the closer you are to replacing a bad floor panel, the more likely you are to step through it
00:22 pwohlers: so much for watching tv tonight
00:23 seldo: pithy aphorisms aside, can you tell me what actually happened just then?
00:25 pwohlers: sjc was drained and reverted to back in service.
00:25 seldo: And whether I should expect it to happen again?
00:25 pwohlers: better not.
00:26 seldo: So when SJC was put back into service it fell over again?
00:26 dormando: I just had someone disable it harder.
00:26 dormando: it can't come back up without people typing passwords.
00:26 seldo: So it came back into service automatically after a while?
00:27 dormando: Yeah. a transient downtime flag was used, it doesn't normally disappear on its own though.
00:27 dormando: we're not sure why it disappeared y et.
00:27 dormando: SJC is a very old build.
00:28 seldo: Why did SJC going down take out what appeared to be worldwide traffic?
00:29 dormando: SJC serves part of asia
00:29 dormando: as does LAX
00:30 seldo: FRA, SYD, JFK, AMS, LCY, DAL, TYO, IAD, DFW, SYD all appear in our 503s
00:30 seldo: What's the connection between them and SJC?
00:31 dormando: you're not shielding anything through SJC are you?
00:31 dormando: I can't remember if you'd split it
00:31 dormando: I moved some people out of SJC into LAX last week
00:31 dormando: and we tried to move most everyone out of ASH and into IAD
00:33 seldo: We were entirely in Ashburn
00:33 dormando: seldo: it'd have to be shielding. I can't think of any other reason
00:33 dormando: you don't have some backends in sjc, some in ash?
00:33 seldo: When we moved out of Ashburn we moved into a mixture of LAX, SJC and IAD
00:33 dormando: ok, so the SJC ones probably failed and didn't retry other backends
00:34 seldo: So we should have been faster at implementing the trick of adding multiple datacenters to our shielding
00:35 dormando: well we should stop sucking, for one thing.
00:35 dormando: if you can avoid SJC that's probably for the best
00:35 dormando: I'd been pulling people out of it
00:35 seldo: If SJC is out right now, what's happening to our hosts that are currently configured to shield from SJC?
00:35 dormando: with shielding down, unless you have something set to retry a different backend, it'll not do origin shield
00:36 dormando: they'll work, but they'll be doing intra-pop shielding only
00:36 seldo: Okay.
00:39 dormando: I'm not really sure what to say. this is unacceptable.
00:43 jmdade has joined ([email protected])
00:54 seldo: So what do we think: can I sleep?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment