bensternthal · September 18, 2015 00:18
diff --git a/#deis 2015-09-17 b/#deis 2015-09-17
 12:29 You have joined the channel
 12:29 Mode: +nt
 12:29 Created at: Aug 31, 2015, 6:50 PM
 15:13 jgmize
 pmac: tmux -S /tmp/shareds attach -t shared
 15:43 krutten
 Hello
 15:43 krutten
 the host is out of file handles, we are looking for why
 15:45 jgmize
 I think it's probably the tiny nat host on that cluster
 15:46 jgmize
 it's a bottleneck for outbound traffic so there's possibly too many fh open for tcp sockets
 15:46 jgmize
 I resized it on the other cluster but forgot to go back and do it on this one
 15:47 jgmize
 could be wrong of course and I'm open to other possibilities
 15:47 krutten
 can we adjust it with the lack of file handles
 15:53 krutten
 jgmize: can you restart or resize the nat host?
 15:55 jgmize
 krutten: looking into that now
 15:56 krutten
 Fleet is just trying to talk to etcd, so we believe when the handles are freed up, it should recover
 15:57 krutten
 we should be able to see it clearly in the journal after the NAT restart
 16:05 jgmize
 krutten: as you and carmstrong can probably tell from the shared tmux session, I'm trying to figure out what I need to set to run the update-vpc.sh against this specific cluster-- do you happen to know?
 16:06 jgmize
 if not, and krancour isn't busy, maybe he could tell us since he wrote it? :)
 16:06 krutten
 checking
 16:11 krutten
 jgmize: whats the stack name for us-west?
 16:11 jgmize
 deis-vpc
 16:12 krutten
 you just want to apply this to the one cluster right?
 16:12 jgmize
 right
 16:12 jgmize
 I think I just need to set the region
 16:12 jgmize
 to us-west-2
 16:12 krutten
 that makes sense
 16:13 jgmize
 the cloud formation stack name is the same in both regions, I'm just trying to figure out where to specify it
 16:13 jgmize
 it being the region
 16:15 krutten
 jgmize: can you look at
 16:15 krutten
 ~/.aws/config
 16:15 jgmize
 ok I think the AWS_DEFAULT_REGION env var should work
 16:15 krutten
 [ruby-2.1.5] Aries:.aws krutten$ cat config
 16:15 krutten
 [default]
 16:15 jgmize
 running it now
 16:15 krutten
 region = us-east-1
 16:15 krutten
 [ruby-2.1.5] Aries:.aws krutten$
 16:15 krutten
 ENV should work also
 16:16 krutten
 or `aws config`
 16:19 jgmize
 ok, the nat host is now an m4.large instead of a t2.micro
 16:21 krutten
 jgmize: can you run
 16:21 krutten
 journalctl -n 50 -u fleet --no-pager
 16:24 jgmize
 sure and you can drive now if you want
 16:27 krutten
 jgmize: localhost was missing from /etc/hosts
 16:27 jgmize
 yes, is that a known issue or something ne?
 16:27 krutten
 so the traffic was hitting DNS
 16:27 jgmize
 new
 16:28 jgmize
 is this something in the cloudformation template?
 16:28 krutten
 There is some pull requests krancour and chris are looking at.
 16:28 jgmize
 can you give me a link?
 16:29 krutten
 https://github.com/deis/deis/pull/4221
 16:29 jgmize
 thanks
 16:29 krutten
 We are still looking at other things
 16:30 krutten
 jgmize: can we go on all the hosts and add localhost to the 127.0.0.1 line on /etc/hosts ?
 16:30 jgmize
 krutten: yes
 16:31 krutten
 I'll let you drive as you know the instances better then I
 16:34 jgmize
 ok, all the nodes in that cluster have that change applied. should we restart any services?
 16:35 krutten
 it should start to heal
 16:35 krutten
 I'd be curious to run fleetctl list-machines on each node
 16:38 krutten
 lookup localhost: too many open files
 16:39 krutten
 jgmize: can Chris drive for a minute?
 16:40 jgmize
 krutten: sure
 16:40 krutten
 looks like restarting fleet may be needed when it;'s in this state
 16:40 krutten
 update hosts and restart fleet.  it's not rereading hosts :-/
 16:41 krutten
 which I expected to happen at the libc level
 16:43 krutten
 jgmize: I don;t have the IP of the last machine
 16:44 jgmize
 omp
 16:44 krutten
 found it
 16:45 krutten
 matches etcd's list
 16:46 krutten
 so the lack of localhost in /etc/hosts put pressure on the NAT server until fleet started to fail the lookup of localhost:4001
 16:46 krutten
 so we fixed both half of the issue, NAT capacity and localhost lookup
 16:47 jgmize
 ok, I'll apply this fix to the other cluster as well, thanks for your help
 16:48 krutten
 Managing /etc/hosts should fall on the OS and provisioning service for multiple reasons which is why we don't try to manage it (we removed that part)
 16:48 jgmize
 the other half-- it already has the larger nat cluster
 16:48 krutten
 but I've suggested a check on install to verify localhost is present and warn the user/installer
 16:48 jgmize
 sounds good
 16:49 krutten
 jgmize: after the NAT was resized, I'm not sure why that didn't "solve" it (cover it up)
 16:52 krutten
 jgmize: does things look healthy now?
 16:58 jgmize
 on this cluster, yes. I'm still applying the /etc/hosts changes to the other cluster, but since that one uses k8s do you think I should go ahead and restart it instead of fleet?
 17:00 krutten
 I'd watch the logs.  if it's not failing leaving it up may be a good test
 17:00 krutten
 if it's production, then I would rolling restart it to be safe though
 17:06 jgmize
 the other one has been having issues too, but it's a much more experimental config and we wanted to focus on this one first. I think the /etc/hosts issue has been affecting both clusters though, we just didn't see the extent of the issues until we started doing stress testing
	12:29 You have joined the channel
	12:29 Mode: +nt
	12:29 Created at: Aug 31, 2015, 6:50 PM
	15:13 jgmize
	pmac: tmux -S /tmp/shareds attach -t shared
	15:43 krutten
	Hello
	15:43 krutten
	the host is out of file handles, we are looking for why
	15:45 jgmize
	I think it's probably the tiny nat host on that cluster
	15:46 jgmize
	it's a bottleneck for outbound traffic so there's possibly too many fh open for tcp sockets
	15:46 jgmize
	I resized it on the other cluster but forgot to go back and do it on this one
	15:47 jgmize
	could be wrong of course and I'm open to other possibilities
	15:47 krutten
	can we adjust it with the lack of file handles
	15:53 krutten
	jgmize: can you restart or resize the nat host?
	15:55 jgmize
	krutten: looking into that now
	15:56 krutten
	Fleet is just trying to talk to etcd, so we believe when the handles are freed up, it should recover
	15:57 krutten
	we should be able to see it clearly in the journal after the NAT restart
	16:05 jgmize
	krutten: as you and carmstrong can probably tell from the shared tmux session, I'm trying to figure out what I need to set to run the update-vpc.sh against this specific cluster-- do you happen to know?
	16:06 jgmize
	if not, and krancour isn't busy, maybe he could tell us since he wrote it? :)
	16:06 krutten
	checking
	16:11 krutten
	jgmize: whats the stack name for us-west?
	16:11 jgmize
	deis-vpc
	16:12 krutten
	you just want to apply this to the one cluster right?
	16:12 jgmize
	right
	16:12 jgmize
	I think I just need to set the region
	16:12 jgmize
	to us-west-2
	16:12 krutten
	that makes sense
	16:13 jgmize
	the cloud formation stack name is the same in both regions, I'm just trying to figure out where to specify it
	16:13 jgmize
	it being the region
	16:15 krutten
	jgmize: can you look at
	16:15 krutten
	~/.aws/config
	16:15 jgmize
	ok I think the AWS_DEFAULT_REGION env var should work
	16:15 krutten
	[ruby-2.1.5] Aries:.aws krutten$ cat config
	16:15 krutten
	[default]
	16:15 jgmize
	running it now
	16:15 krutten
	region = us-east-1
	16:15 krutten
	[ruby-2.1.5] Aries:.aws krutten$
	16:15 krutten
	ENV should work also
	16:16 krutten
	or `aws config`
	16:19 jgmize
	ok, the nat host is now an m4.large instead of a t2.micro
	16:21 krutten
	jgmize: can you run
	16:21 krutten
	journalctl -n 50 -u fleet --no-pager
	16:24 jgmize
	sure and you can drive now if you want
	16:27 krutten
	jgmize: localhost was missing from /etc/hosts
	16:27 jgmize
	yes, is that a known issue or something ne?
	16:27 krutten
	so the traffic was hitting DNS
	16:27 jgmize
	new
	16:28 jgmize
	is this something in the cloudformation template?
	16:28 krutten
	There is some pull requests krancour and chris are looking at.
	16:28 jgmize
	can you give me a link?
	16:29 krutten
	https://github.com/deis/deis/pull/4221
	16:29 jgmize
	thanks
	16:29 krutten
	We are still looking at other things
	16:30 krutten
	jgmize: can we go on all the hosts and add localhost to the 127.0.0.1 line on /etc/hosts ?
	16:30 jgmize
	krutten: yes
	16:31 krutten
	I'll let you drive as you know the instances better then I
	16:34 jgmize
	ok, all the nodes in that cluster have that change applied. should we restart any services?
	16:35 krutten
	it should start to heal
	16:35 krutten
	I'd be curious to run fleetctl list-machines on each node
	16:38 krutten
	lookup localhost: too many open files
	16:39 krutten
	jgmize: can Chris drive for a minute?
	16:40 jgmize
	krutten: sure
	16:40 krutten
	looks like restarting fleet may be needed when it;'s in this state
	16:40 krutten
	update hosts and restart fleet. it's not rereading hosts :-/
	16:41 krutten
	which I expected to happen at the libc level
	16:43 krutten
	jgmize: I don;t have the IP of the last machine
	16:44 jgmize
	omp
	16:44 krutten
	found it
	16:45 krutten
	matches etcd's list
	16:46 krutten
	so the lack of localhost in /etc/hosts put pressure on the NAT server until fleet started to fail the lookup of localhost:4001
	16:46 krutten
	so we fixed both half of the issue, NAT capacity and localhost lookup
	16:47 jgmize
	ok, I'll apply this fix to the other cluster as well, thanks for your help
	16:48 krutten
	Managing /etc/hosts should fall on the OS and provisioning service for multiple reasons which is why we don't try to manage it (we removed that part)
	16:48 jgmize
	the other half-- it already has the larger nat cluster
	16:48 krutten
	but I've suggested a check on install to verify localhost is present and warn the user/installer
	16:48 jgmize
	sounds good
	16:49 krutten
	jgmize: after the NAT was resized, I'm not sure why that didn't "solve" it (cover it up)
	16:52 krutten
	jgmize: does things look healthy now?
	16:58 jgmize
	on this cluster, yes. I'm still applying the /etc/hosts changes to the other cluster, but since that one uses k8s do you think I should go ahead and restart it instead of fleet?
	17:00 krutten
	I'd watch the logs. if it's not failing leaving it up may be a good test
	17:00 krutten
	if it's production, then I would rolling restart it to be safe though
	17:06 jgmize
	the other one has been having issues too, but it's a much more experimental config and we wanted to focus on this one first. I think the /etc/hosts issue has been affecting both clusters though, we just didn't see the extent of the issues until we started doing stress testing