Created
June 12, 2014 07:57
-
-
Save seldo/e882400411bb54b485f7 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
00:10 seldo: all: we are seeing another huge burst of 503s | |
00:10 seldo: What's up? | |
00:11 pwohlers: checking | |
00:14 seldo: All our internal checks are green and I have manually verified no problems hitting our servers | |
00:17 pwohlers: seldo - looks like we're seeing packet loss out of SJC | |
00:17 othiym23: pwohlers: fwiw, we (npm) are getting error reports from users in Australia, who are also saying they're having problems with reddit | |
00:17 othiym23: not implying causation, just a data point | |
00:18 pwohlers: thanks | |
00:18 seldo: I am seeing 503s in our logs (from fastly) from every data center: FRA, SJC, LAZ, ORD, LCY, AMS, etc. | |
00:19 pwohlers: where are your origins seldo ? | |
00:19 seldo: spread across AWS us-west-2 and us-east-1 | |
00:19 pwohlers: ok | |
00:20 seldo: aaaand suddenly they stopped | |
00:20 seldo: www and registry are immediately back to full health | |
00:21 pwohlers: good | |
00:21 seldo: What just happened? | |
00:22 dormando: the closer you are to replacing a bad floor panel, the more likely you are to step through it | |
00:22 pwohlers: so much for watching tv tonight | |
00:23 seldo: pithy aphorisms aside, can you tell me what actually happened just then? | |
00:25 pwohlers: sjc was drained and reverted to back in service. | |
00:25 seldo: And whether I should expect it to happen again? | |
00:25 pwohlers: better not. | |
00:26 seldo: So when SJC was put back into service it fell over again? | |
00:26 dormando: I just had someone disable it harder. | |
00:26 dormando: it can't come back up without people typing passwords. | |
00:26 seldo: So it came back into service automatically after a while? | |
00:27 dormando: Yeah. a transient downtime flag was used, it doesn't normally disappear on its own though. | |
00:27 dormando: we're not sure why it disappeared y et. | |
00:27 dormando: SJC is a very old build. | |
00:28 seldo: Why did SJC going down take out what appeared to be worldwide traffic? | |
00:29 dormando: SJC serves part of asia | |
00:29 dormando: as does LAX | |
00:30 seldo: FRA, SYD, JFK, AMS, LCY, DAL, TYO, IAD, DFW, SYD all appear in our 503s | |
00:30 seldo: What's the connection between them and SJC? | |
00:31 dormando: you're not shielding anything through SJC are you? | |
00:31 dormando: I can't remember if you'd split it | |
00:31 dormando: I moved some people out of SJC into LAX last week | |
00:31 dormando: and we tried to move most everyone out of ASH and into IAD | |
00:33 seldo: We were entirely in Ashburn | |
00:33 dormando: seldo: it'd have to be shielding. I can't think of any other reason | |
00:33 dormando: you don't have some backends in sjc, some in ash? | |
00:33 seldo: When we moved out of Ashburn we moved into a mixture of LAX, SJC and IAD | |
00:33 dormando: ok, so the SJC ones probably failed and didn't retry other backends | |
00:34 seldo: So we should have been faster at implementing the trick of adding multiple datacenters to our shielding | |
00:35 dormando: well we should stop sucking, for one thing. | |
00:35 dormando: if you can avoid SJC that's probably for the best | |
00:35 dormando: I'd been pulling people out of it | |
00:35 seldo: If SJC is out right now, what's happening to our hosts that are currently configured to shield from SJC? | |
00:35 dormando: with shielding down, unless you have something set to retry a different backend, it'll not do origin shield | |
00:36 dormando: they'll work, but they'll be doing intra-pop shielding only | |
00:36 seldo: Okay. | |
00:39 dormando: I'm not really sure what to say. this is unacceptable. | |
00:43 jmdade has joined ([email protected]) | |
00:54 seldo: So what do we think: can I sleep? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment