Created
September 18, 2015 00:18
-
-
Save bensternthal/f32eb3b9398c73695ddc to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
12:29 You have joined the channel | |
12:29 Mode: +nt | |
12:29 Created at: Aug 31, 2015, 6:50 PM | |
15:13 jgmize | |
pmac: tmux -S /tmp/shareds attach -t shared | |
15:43 krutten | |
Hello | |
15:43 krutten | |
the host is out of file handles, we are looking for why | |
15:45 jgmize | |
I think it's probably the tiny nat host on that cluster | |
15:46 jgmize | |
it's a bottleneck for outbound traffic so there's possibly too many fh open for tcp sockets | |
15:46 jgmize | |
I resized it on the other cluster but forgot to go back and do it on this one | |
15:47 jgmize | |
could be wrong of course and I'm open to other possibilities | |
15:47 krutten | |
can we adjust it with the lack of file handles | |
15:53 krutten | |
jgmize: can you restart or resize the nat host? | |
15:55 jgmize | |
krutten: looking into that now | |
15:56 krutten | |
Fleet is just trying to talk to etcd, so we believe when the handles are freed up, it should recover | |
15:57 krutten | |
we should be able to see it clearly in the journal after the NAT restart | |
16:05 jgmize | |
krutten: as you and carmstrong can probably tell from the shared tmux session, I'm trying to figure out what I need to set to run the update-vpc.sh against this specific cluster-- do you happen to know? | |
16:06 jgmize | |
if not, and krancour isn't busy, maybe he could tell us since he wrote it? :) | |
16:06 krutten | |
checking | |
16:11 krutten | |
jgmize: whats the stack name for us-west? | |
16:11 jgmize | |
deis-vpc | |
16:12 krutten | |
you just want to apply this to the one cluster right? | |
16:12 jgmize | |
right | |
16:12 jgmize | |
I think I just need to set the region | |
16:12 jgmize | |
to us-west-2 | |
16:12 krutten | |
that makes sense | |
16:13 jgmize | |
the cloud formation stack name is the same in both regions, I'm just trying to figure out where to specify it | |
16:13 jgmize | |
it being the region | |
16:15 krutten | |
jgmize: can you look at | |
16:15 krutten | |
~/.aws/config | |
16:15 jgmize | |
ok I think the AWS_DEFAULT_REGION env var should work | |
16:15 krutten | |
[ruby-2.1.5] Aries:.aws krutten$ cat config | |
16:15 krutten | |
[default] | |
16:15 jgmize | |
running it now | |
16:15 krutten | |
region = us-east-1 | |
16:15 krutten | |
[ruby-2.1.5] Aries:.aws krutten$ | |
16:15 krutten | |
ENV should work also | |
16:16 krutten | |
or `aws config` | |
16:19 jgmize | |
ok, the nat host is now an m4.large instead of a t2.micro | |
16:21 krutten | |
jgmize: can you run | |
16:21 krutten | |
journalctl -n 50 -u fleet --no-pager | |
16:24 jgmize | |
sure and you can drive now if you want | |
16:27 krutten | |
jgmize: localhost was missing from /etc/hosts | |
16:27 jgmize | |
yes, is that a known issue or something ne? | |
16:27 krutten | |
so the traffic was hitting DNS | |
16:27 jgmize | |
new | |
16:28 jgmize | |
is this something in the cloudformation template? | |
16:28 krutten | |
There is some pull requests krancour and chris are looking at. | |
16:28 jgmize | |
can you give me a link? | |
16:29 krutten | |
https://github.com/deis/deis/pull/4221 | |
16:29 jgmize | |
thanks | |
16:29 krutten | |
We are still looking at other things | |
16:30 krutten | |
jgmize: can we go on all the hosts and add localhost to the 127.0.0.1 line on /etc/hosts ? | |
16:30 jgmize | |
krutten: yes | |
16:31 krutten | |
I'll let you drive as you know the instances better then I | |
16:34 jgmize | |
ok, all the nodes in that cluster have that change applied. should we restart any services? | |
16:35 krutten | |
it should start to heal | |
16:35 krutten | |
I'd be curious to run fleetctl list-machines on each node | |
16:38 krutten | |
lookup localhost: too many open files | |
16:39 krutten | |
jgmize: can Chris drive for a minute? | |
16:40 jgmize | |
krutten: sure | |
16:40 krutten | |
looks like restarting fleet may be needed when it;'s in this state | |
16:40 krutten | |
update hosts and restart fleet. it's not rereading hosts :-/ | |
16:41 krutten | |
which I expected to happen at the libc level | |
16:43 krutten | |
jgmize: I don;t have the IP of the last machine | |
16:44 jgmize | |
omp | |
16:44 krutten | |
found it | |
16:45 krutten | |
matches etcd's list | |
16:46 krutten | |
so the lack of localhost in /etc/hosts put pressure on the NAT server until fleet started to fail the lookup of localhost:4001 | |
16:46 krutten | |
so we fixed both half of the issue, NAT capacity and localhost lookup | |
16:47 jgmize | |
ok, I'll apply this fix to the other cluster as well, thanks for your help | |
16:48 krutten | |
Managing /etc/hosts should fall on the OS and provisioning service for multiple reasons which is why we don't try to manage it (we removed that part) | |
16:48 jgmize | |
the other half-- it already has the larger nat cluster | |
16:48 krutten | |
but I've suggested a check on install to verify localhost is present and warn the user/installer | |
16:48 jgmize | |
sounds good | |
16:49 krutten | |
jgmize: after the NAT was resized, I'm not sure why that didn't "solve" it (cover it up) | |
16:52 krutten | |
jgmize: does things look healthy now? | |
16:58 jgmize | |
on this cluster, yes. I'm still applying the /etc/hosts changes to the other cluster, but since that one uses k8s do you think I should go ahead and restart it instead of fleet? | |
17:00 krutten | |
I'd watch the logs. if it's not failing leaving it up may be a good test | |
17:00 krutten | |
if it's production, then I would rolling restart it to be safe though | |
17:06 jgmize | |
the other one has been having issues too, but it's a much more experimental config and we wanted to focus on this one first. I think the /etc/hosts issue has been affecting both clusters though, we just didn't see the extent of the issues until we started doing stress testing |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment