hashbrowncipher · January 20, 2015 18:03
diff --git a/gistfile1.txt b/gistfile1.txt
 2015-01-19 12:02:46	josnyder	hi rob_, t0m
 2015-01-19 12:03:19	josnyder	i indeed overlooked that we should be using a sensu agent socket to send messages
 2015-01-19 12:05:46	josnyder	rob_: I'm specifically thinking of a use case where a job gets scheduled over a chronos cluster and sends results to sensu
 2015-01-19 12:06:39	josnyder	so any of 100s of machines could run the job, and then want to report the same client/check pair to the sensu aggregator
 2015-01-19 12:06:52	rob_	josnyder: hey :)
 2015-01-19 12:07:29	rob_	josnyder: sounds like you could just send it to a sensu-agent somewhere then, right?
 2015-01-19 12:07:38	josnyder	yeah, each of the nodes will have a sensu agent
 2015-01-19 12:07:40	rob_	could even have a few sensu agents running and load balance them :)
 2015-01-19 12:07:44	josnyder	yep
 2015-01-19 12:07:58	josnyder	but there's no guarantee that the job will be scheduled or run correctly
 2015-01-19 12:08:11	josnyder	so I'd like to be notified if a check result fails to be sent
 2015-01-19 12:09:12	rob_	josnyder: ah, and that's why you wanted to introduce per-check keepalives?
 2015-01-19 12:09:18	josnyder	yep
 2015-01-19 12:09:37	rob_	josnyder: did you see how sensu server processes *any* check data as a keepalive, as well as the default keepalive subscription?
 2015-01-19 12:09:56	josnyder	no...this is interesting to me
 2015-01-19 12:10:35	rob_	josnyder: im guessing the actual keepalive subscription is there for when a server has no checks defined
 2015-01-19 12:11:35	rob_	josnyder: determine_stale_clients actually loops over all client instances when it checks for machines with failed keepalives, see: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L573
 2015-01-19 12:12:22	josnyder	rob_: yeah, but there can only be one check on each client with this functionality, right?
 2015-01-19 12:13:56	rob_	josnyder: oh wait, i think i misunderstood something..
 2015-01-19 12:14:23	rob_	josnyder: the client data is sent with every keepalive, so a keepalive isnt a check, it's an update of the entire client data in redis...
 2015-01-19 12:14:38	rob_	(i think)
 2015-01-19 12:14:41	 *	josnyder checks
 2015-01-19 12:15:56	josnyder	rob_: the only place I see the client:hostname key in redis being set is https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L34
 2015-01-19 12:18:21	rob_	josnyder: yeah, so that subscription sets up the (re-)writing of the client data to redis on every keepalive
 2015-01-19 12:18:41	josnyder	ah, yep
 2015-01-19 12:18:47	josnyder	that block gets called on every keepalive
 2015-01-19 12:19:02	rob_	then the timestamp is read from the client data here: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L585
 2015-01-19 12:19:40	rob_	ok, so lets look at your usecase again
 2015-01-19 12:20:33	rob_	i would consider using some kind of proxy between a sensu-agent and your event producer that sends a warn/crit event to sensu if a check isnt received when one is expected
 2015-01-19 12:21:01	rob_	that way you can use the (very simple) masquerading that we've already come up with without adding extra complexity to sensu but still get the functionality you want
 2015-01-19 12:21:04	josnyder	yep, that's one possible implementation
 2015-01-19 12:21:38	josnyder	I think it's fair to go down that road
 2015-01-19 12:23:17	josnyder	the thing I find interesting is how much the implementation of such a proxy would match the implementation of determine_stale_clients
 2015-01-19 12:23:25	josnyder	it would be basically identical
 2015-01-19 12:24:06	josnyder	that's the only good reason I have for implementing that logic within sensu
 2015-01-19 12:24:28	josnyder	rob_: you've taught me a bit; thanks
 2015-01-19 12:24:56	rob_	josnyder: you could create a sensu-client that only sends a keepalive when it receives a peice of event-data
 2015-01-19 12:25:22	josnyder	rob_: why don't we spin this discussion out into a new github issue
 2015-01-19 12:25:26	rob_	instead of being an agent it'd be more like proxy since it needn't schedule anything
 2015-01-19 12:25:44	josnyder	yep, but then you need a new source for every check
 2015-01-19 12:26:04	josnyder	say I have 400 batch jobs running on my chronos cluster
 2015-01-19 12:26:26	josnyder	monitoring each one would be 400 proxies (or one proxy that acts like 400 clients)
 2015-01-19 12:27:06	rob_	i was picturing each job configured with the same source sending event data through a single proxy
 2015-01-19 12:27:21	josnyder	yeah, but there's no guarantee that the jobs fail together at once
 2015-01-19 12:27:26	rob_	maybe the mental image i have of what you're talking about is wrong :)
 2015-01-19 12:27:49	rob_	ah, so you care about each individual batch job?
 2015-01-19 12:27:55	josnyder	yep
 2015-01-19 12:28:19	josnyder	if someone makes a code change that causes a given batch job to fail, I want a single check to go critical on the other end
 2015-01-19 12:29:39	josnyder	rob_: at this point I think you should be doing what t0m suggested
 2015-01-19 12:29:45	josnyder	which is to say 'patches welcome'
 2015-01-19 12:30:00	josnyder	unless, of course, you have a lingering architectural objection
 2015-01-19 12:30:03	rob_	ok, i think i get it.. i think the first test i'd do is to try sending event data through the sensu socket and then seeing what happens if you only send one event..
 2015-01-19 12:30:40	josnyder	we actually have a script that solarkennedy wrote
 2015-01-19 12:30:42	rob_	josnyder: i have no objections, i think we're going to be deploying mesos where i work this quarter so this is of interest to me :)
 2015-01-19 12:30:44	josnyder	that does exactly that
 2015-01-19 12:31:05	rob_	josnyder: what happens..?
 2015-01-19 12:31:25	josnyder	the check shows up in sensu, and...lingers
 2015-01-19 12:31:26	josnyder	iirc
 2015-01-19 12:31:36	josnyder	let me check
 2015-01-19 12:31:43	rob_	ah, so it's state never changes?
 2015-01-19 12:32:07	rob_	does the client do anything clever with event data it receives through the socket?
 2015-01-19 12:32:18	 *	rob_ goes to have a look
 2015-01-19 12:33:29	rob_	so i think the socket is just a tunnel to rabbitmq..?
 2015-01-19 12:35:11	rob_	josnyder: are you talking about his sensu-shell-helper stuff? i've been meaning to take a look at it :)
 2015-01-19 12:35:38	josnyder	I may be...
 2015-01-19 12:35:42	josnyder	let me see if it's open source already
 2015-01-19 12:35:51	josnyder	if not...I can post it as a gist
 2015-01-19 12:38:26	josnyder	rob_: https://gist.github.com/hashbrowncipher/c044bc60a2c8d577dc50
 2015-01-19 12:38:45	josnyder	it's really convenient
 2015-01-19 12:38:50	rob_	oh cool!
 2015-01-19 12:39:00	rob_	https://github.com/sensu/sensu/blob/master/lib/sensu/client.rb#L230
 2015-01-19 12:39:17	rob_	this looks like where the client opens a socket and binds it to the rabbitmq transport
 2015-01-19 12:39:37	rob_	i cant find anything in the code that looks like it's modifying the event data
 2015-01-19 12:39:51	rob_	unless it's in sensu-em
 2015-01-19 12:41:01	rob_	ah that's just a fork of eventmachine
 2015-01-19 12:41:15	josnyder	rob_: so yeah, one of my hosts now has a resolved test_alert_for_josnyder
 2015-01-19 12:41:21	josnyder	I don't think it will ever go away
 2015-01-19 12:41:40	josnyder	well...until redis loses track of the key somehow
 2015-01-19 12:42:07	rob_	josnyder: ok :) now your idea to add a timeout to each event data makes more sense to me
 2015-01-19 12:42:22	rob_	er, timestamp
 2015-01-19 12:45:58	 *	josnyder sends some probing queries to redis
 2015-01-19 12:46:31	rob_	i wonder how nagios would handle this
 2015-01-19 12:48:37	josnyder	i mean, right no we're using nagios with passive checks and passive check timeouts
 2015-01-19 12:48:52	rob_	ah, right, of course
 2015-01-19 12:48:53	josnyder	the feature is pretty directly analogous to how nagios does it
 2015-01-19 12:51:20	rob_	hmm
 2015-01-19 12:53:52	rob_	ok, so basically there would be a 'determine_stale_checks' loop that would publish a critical event data for a check if whatever threshold is breached
 2015-01-19 12:54:00	rob_	maybe it wouldnt be that much effort to impliment
 2015-01-19 12:54:39	josnyder	I hope so
 2015-01-19 12:55:07	josnyder	rob_: btw. one of our development servers now has a 'test_alert_for_josnyder' displayed on it in the dashboard
 2015-01-19 12:55:12	josnyder	I don't think it will ever go away
 2015-01-19 12:55:16	rob_	ok, at that point i think the original masquerading stuff doesn't change..
 2015-01-19 12:55:55	rob_	josnyder: redis-cli rem '*test_alert_for_josnyder*' will probably do the job
 2015-01-19 12:56:15	josnyder	yep, that would work
 2015-01-19 12:56:38	rob_	josnyder: would be interesting to leave it for 24hr just to see what happens though
 2015-01-19 12:56:53	josnyder	rob_: looking at our uchiwa, we have plenty of checks that are over 24 hours old
 2015-01-19 12:57:38	josnyder	I'm seeing green checks from 2015-01-08 and 2015-01-13
 2015-01-19 12:57:54	josnyder	I think it would go back further if we hadn't cleared redis during our recent sensu upgrade
 2015-01-19 12:59:15	rob_	josnyder: ok, so this seems like a legit bug then
 2015-01-19 12:59:26	rob_	given the socket functionality exists
 2015-01-19 13:01:01	josnyder	rob_: what do you think is a bug, specifically?
 2015-01-19 13:01:15	josnyder	I think the behavior we're looking at is actually a feature, of sorts
 2015-01-19 13:02:16	rob_	josnyder: i think it's a problem that if event data for a check isnt updated then it's almost certainly going to be stale data after a period of time
 2015-01-19 13:03:10	josnyder	yeah
 2015-01-19 13:03:47	josnyder	I think what's going on is that the sensu server has no clue whether an event is stale, or whether it just runs very infrequently
 2015-01-19 13:03:50	rob_	i think it's usually 'ok' since scheduled checks are always going to get their data updated unless the server is down at which point the keepalive event will show the problem
 2015-01-19 13:04:00	josnyder	yep
 2015-01-19 13:04:03	rob_	that's true
 2015-01-19 13:04:21	josnyder	so if we gave the server some idea of "the check should be updated at least this often"
 2015-01-19 13:04:37	rob_	josnyder: i think that already exists, right?
 2015-01-19 13:04:42	josnyder	it does?
 2015-01-19 13:04:58	rob_	well a check definition has interval
 2015-01-19 13:05:35	rob_	so if we add a :timestamp to event data and then the timestamp is less than: Time.now - interval
 2015-01-19 13:05:54	josnyder	timestamp is already there
 2015-01-19 13:06:01	josnyder	it gets stored as execution:client:check
 2015-01-19 13:06:03	josnyder	in redis
 2015-01-19 13:06:16	rob_	oh, sweet
 2015-01-19 13:06:21	josnyder	but right now there's no server side logic that makes use of the 'interval' key in check data
 2015-01-19 13:06:37	rob_	so we can just add something to process_result
 2015-01-19 13:07:23	rob_	ah, the timestamp is 'check[:executed]
 2015-01-19 13:07:25	rob_	'
 2015-01-19 13:07:46	rob_	oh, hmm, that wouldnt work
 2015-01-19 13:10:42	rob_	i think we'd have to write the event data - or at least just the interval, to redis
 2015-01-19 13:11:00	josnyder	yeah
 2015-01-19 13:11:10	rob_	then have a determine_stale_checks method
 2015-01-19 13:11:28	josnyder	well, there would be two possible behaviors when a check is determined to be stale
 2015-01-19 13:11:42	josnyder	1) assume this is the correct operation, and that (for instance) Puppet removed the check because it's no longer desired
 2015-01-19 13:12:03	josnyder	2) assume this is errant operation, and send a CRITICAL event for that check
 2015-01-19 13:12:16	rob_	for 1) you can call the sensu-api to remove the check from redis..
 2015-01-19 13:12:32	rob_	like what uchiwa does if you 'X' a check or host
 2015-01-19 13:12:42	rob_	i think..
 2015-01-19 13:12:59	josnyder	I didn't know you could 'X' an invidual check
 2015-01-19 13:13:01	 *	josnyder looks
 2015-01-19 13:13:16	rob_	i think it's called 'resolve'
 2015-01-19 13:13:43	josnyder	resolving a check doesn't remove it
 2015-01-19 13:13:53	rob_	what happens?
 2015-01-19 13:13:58	 *	josnyder checks that assertion
 2015-01-19 13:14:10	rob_	i thought it just clears the data from redis for that check
 2015-01-19 13:16:15	josnyder	rob_: I ran send-test-sensu-alert
 2015-01-19 13:16:25	josnyder	and then did 'sensu-cli resolve dev4-devb.dev.yelpcorp.com "test_alert_for_josnyder"'
 2015-01-19 13:16:39	josnyder	the check still shows up in uchiwa
 2015-01-19 13:16:48	rob_	josnyder: uchiwa can take a second to update..
 2015-01-19 13:17:01	rob_	what does sensu-cli list events say?
 2015-01-19 13:17:23	rob_	also redis-cil keys '*test_alert_for_josnyder*'
 2015-01-19 13:18:08	josnyder	yeah, two keys
 2015-01-19 13:18:20	josnyder	1) "execution:dev4-devb.dev.yelpcorp.com:test_alert_for_josnyder"
 2015-01-19 13:18:20	josnyder	2) "history:dev4-devb.dev.yelpcorp.com:test_alert_for_josnyder"Y
 2015-01-19 13:18:57	rob_	what values do they contain?
 2015-01-19 13:19:01	rob_	maybe it just wipes out the history
 2015-01-19 13:20:42	josnyder	https://gist.github.com/hashbrowncipher/ccc2a8257c80899282ce
 2015-01-19 13:21:23	rob_	ah so it writes a 0 to history, sneaky
 2015-01-19 13:21:41	rob_	and possibly updates the execution time :)
 2015-01-19 13:24:15	josnyder	yeah, it's just this
 2015-01-19 13:24:15	josnyder	https://github.com/Yelp/sensu/blob/fake_server_source/lib/sensu/api.rb#L207
 2015-01-19 13:24:33	josnyder	all it does is create an event with status=>0
 2015-01-19 13:27:22	rob_	josnyder: so what's next..? could post something to the mailing list and try to get portertechs input and see what he'd accept into mainline..?
 2015-01-19 13:28:15	josnyder	that seems like a good idea
 2015-01-19 13:28:33	josnyder	I'll also want to talk to some people at Yelp
 2015-01-19 13:29:50	rob_	but you cant do what you want using the sensu-client like how you describe, no
 2015-01-19 13:31:23	rob_	josnyder: interestingly, when you stop a sensu agent and the keepalive goes critical, all the rest of the checks for that host stay OK
 2015-01-19 13:31:29	josnyder	yep
 2015-01-19 13:31:37	josnyder	they're frozen in time
 2015-01-19 13:32:40	rob_	josnyder: i suppose the expected pattern is to actually care about keepalive alerts
 2015-01-19 13:32:54	rob_	i would prefer to create explicit dependencies between a check and keepalive, i think
 2015-01-19 13:33:05	josnyder	I tell people that keepalive alerts mean "your checks on this client aren't being run"
 2015-01-19 13:33:25	josnyder	and to forget any nonsense about the host being down or whatnot
 2015-01-19 13:33:33	rob_	yeah
 2015-01-19 13:34:09	rob_	we have a more service oriented architecture where knowing a machine is down doesnt really tell you what services are down
 2015-01-19 13:34:32	rob_	(for better or for worse)
 2015-01-19 13:34:54	josnyder	rob_: many of our services aren't monitored on individual hosts
 2015-01-19 13:35:09	josnyder	instead, we query the load balancer to ask whether a certain proportion of backends are up
 2015-01-19 13:35:10	rob_	do you do aggregate checks or something?
 2015-01-19 13:35:15	rob_	ah
 2015-01-19 13:35:31	josnyder	that's actually a very good example of where I'd like a check timeout
 2015-01-19 13:35:44	josnyder	because if the check runs and never returns a result, that's a problem
 2015-01-19 13:35:53	rob_	yeah
	2015-01-19 12:02:46 josnyder hi rob_, t0m
	2015-01-19 12:03:19 josnyder i indeed overlooked that we should be using a sensu agent socket to send messages
	2015-01-19 12:05:46 josnyder rob_: I'm specifically thinking of a use case where a job gets scheduled over a chronos cluster and sends results to sensu
	2015-01-19 12:06:39 josnyder so any of 100s of machines could run the job, and then want to report the same client/check pair to the sensu aggregator
	2015-01-19 12:06:52 rob_ josnyder: hey :)
	2015-01-19 12:07:29 rob_ josnyder: sounds like you could just send it to a sensu-agent somewhere then, right?
	2015-01-19 12:07:38 josnyder yeah, each of the nodes will have a sensu agent
	2015-01-19 12:07:40 rob_ could even have a few sensu agents running and load balance them :)
	2015-01-19 12:07:44 josnyder yep
	2015-01-19 12:07:58 josnyder but there's no guarantee that the job will be scheduled or run correctly
	2015-01-19 12:08:11 josnyder so I'd like to be notified if a check result fails to be sent
	2015-01-19 12:09:12 rob_ josnyder: ah, and that's why you wanted to introduce per-check keepalives?
	2015-01-19 12:09:18 josnyder yep
	2015-01-19 12:09:37 rob_ josnyder: did you see how sensu server processes any check data as a keepalive, as well as the default keepalive subscription?
	2015-01-19 12:09:56 josnyder no...this is interesting to me
	2015-01-19 12:10:35 rob_ josnyder: im guessing the actual keepalive subscription is there for when a server has no checks defined
	2015-01-19 12:11:35 rob_ josnyder: determine_stale_clients actually loops over all client instances when it checks for machines with failed keepalives, see: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L573
	2015-01-19 12:12:22 josnyder rob_: yeah, but there can only be one check on each client with this functionality, right?
	2015-01-19 12:13:56 rob_ josnyder: oh wait, i think i misunderstood something..
	2015-01-19 12:14:23 rob_ josnyder: the client data is sent with every keepalive, so a keepalive isnt a check, it's an update of the entire client data in redis...
	2015-01-19 12:14:38 rob_ (i think)
	2015-01-19 12:14:41 * josnyder checks
	2015-01-19 12:15:56 josnyder rob_: the only place I see the client:hostname key in redis being set is https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L34
	2015-01-19 12:18:21 rob_ josnyder: yeah, so that subscription sets up the (re-)writing of the client data to redis on every keepalive
	2015-01-19 12:18:41 josnyder ah, yep
	2015-01-19 12:18:47 josnyder that block gets called on every keepalive
	2015-01-19 12:19:02 rob_ then the timestamp is read from the client data here: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L585
	2015-01-19 12:19:40 rob_ ok, so lets look at your usecase again
	2015-01-19 12:20:33 rob_ i would consider using some kind of proxy between a sensu-agent and your event producer that sends a warn/crit event to sensu if a check isnt received when one is expected
	2015-01-19 12:21:01 rob_ that way you can use the (very simple) masquerading that we've already come up with without adding extra complexity to sensu but still get the functionality you want
	2015-01-19 12:21:04 josnyder yep, that's one possible implementation
	2015-01-19 12:21:38 josnyder I think it's fair to go down that road
	2015-01-19 12:23:17 josnyder the thing I find interesting is how much the implementation of such a proxy would match the implementation of determine_stale_clients
	2015-01-19 12:23:25 josnyder it would be basically identical
	2015-01-19 12:24:06 josnyder that's the only good reason I have for implementing that logic within sensu
	2015-01-19 12:24:28 josnyder rob_: you've taught me a bit; thanks
	2015-01-19 12:24:56 rob_ josnyder: you could create a sensu-client that only sends a keepalive when it receives a peice of event-data
	2015-01-19 12:25:22 josnyder rob_: why don't we spin this discussion out into a new github issue
	2015-01-19 12:25:26 rob_ instead of being an agent it'd be more like proxy since it needn't schedule anything
	2015-01-19 12:25:44 josnyder yep, but then you need a new source for every check
	2015-01-19 12:26:04 josnyder say I have 400 batch jobs running on my chronos cluster
	2015-01-19 12:26:26 josnyder monitoring each one would be 400 proxies (or one proxy that acts like 400 clients)
	2015-01-19 12:27:06 rob_ i was picturing each job configured with the same source sending event data through a single proxy
	2015-01-19 12:27:21 josnyder yeah, but there's no guarantee that the jobs fail together at once
	2015-01-19 12:27:26 rob_ maybe the mental image i have of what you're talking about is wrong :)
	2015-01-19 12:27:49 rob_ ah, so you care about each individual batch job?
	2015-01-19 12:27:55 josnyder yep
	2015-01-19 12:28:19 josnyder if someone makes a code change that causes a given batch job to fail, I want a single check to go critical on the other end
	2015-01-19 12:29:39 josnyder rob_: at this point I think you should be doing what t0m suggested
	2015-01-19 12:29:45 josnyder which is to say 'patches welcome'
	2015-01-19 12:30:00 josnyder unless, of course, you have a lingering architectural objection
	2015-01-19 12:30:03 rob_ ok, i think i get it.. i think the first test i'd do is to try sending event data through the sensu socket and then seeing what happens if you only send one event..
	2015-01-19 12:30:40 josnyder we actually have a script that solarkennedy wrote
	2015-01-19 12:30:42 rob_ josnyder: i have no objections, i think we're going to be deploying mesos where i work this quarter so this is of interest to me :)
	2015-01-19 12:30:44 josnyder that does exactly that
	2015-01-19 12:31:05 rob_ josnyder: what happens..?
	2015-01-19 12:31:25 josnyder the check shows up in sensu, and...lingers
	2015-01-19 12:31:26 josnyder iirc
	2015-01-19 12:31:36 josnyder let me check
	2015-01-19 12:31:43 rob_ ah, so it's state never changes?
	2015-01-19 12:32:07 rob_ does the client do anything clever with event data it receives through the socket?
	2015-01-19 12:32:18 * rob_ goes to have a look
	2015-01-19 12:33:29 rob_ so i think the socket is just a tunnel to rabbitmq..?
	2015-01-19 12:35:11 rob_ josnyder: are you talking about his sensu-shell-helper stuff? i've been meaning to take a look at it :)
	2015-01-19 12:35:38 josnyder I may be...
	2015-01-19 12:35:42 josnyder let me see if it's open source already
	2015-01-19 12:35:51 josnyder if not...I can post it as a gist
	2015-01-19 12:38:26 josnyder rob_: https://gist.github.com/hashbrowncipher/c044bc60a2c8d577dc50
	2015-01-19 12:38:45 josnyder it's really convenient
	2015-01-19 12:38:50 rob_ oh cool!
	2015-01-19 12:39:00 rob_ https://github.com/sensu/sensu/blob/master/lib/sensu/client.rb#L230
	2015-01-19 12:39:17 rob_ this looks like where the client opens a socket and binds it to the rabbitmq transport
	2015-01-19 12:39:37 rob_ i cant find anything in the code that looks like it's modifying the event data
	2015-01-19 12:39:51 rob_ unless it's in sensu-em
	2015-01-19 12:41:01 rob_ ah that's just a fork of eventmachine
	2015-01-19 12:41:15 josnyder rob_: so yeah, one of my hosts now has a resolved test_alert_for_josnyder
	2015-01-19 12:41:21 josnyder I don't think it will ever go away
	2015-01-19 12:41:40 josnyder well...until redis loses track of the key somehow
	2015-01-19 12:42:07 rob_ josnyder: ok :) now your idea to add a timeout to each event data makes more sense to me
	2015-01-19 12:42:22 rob_ er, timestamp
	2015-01-19 12:45:58 * josnyder sends some probing queries to redis
	2015-01-19 12:46:31 rob_ i wonder how nagios would handle this
	2015-01-19 12:48:37 josnyder i mean, right no we're using nagios with passive checks and passive check timeouts
	2015-01-19 12:48:52 rob_ ah, right, of course
	2015-01-19 12:48:53 josnyder the feature is pretty directly analogous to how nagios does it
	2015-01-19 12:51:20 rob_ hmm
	2015-01-19 12:53:52 rob_ ok, so basically there would be a 'determine_stale_checks' loop that would publish a critical event data for a check if whatever threshold is breached
	2015-01-19 12:54:00 rob_ maybe it wouldnt be that much effort to impliment
	2015-01-19 12:54:39 josnyder I hope so
	2015-01-19 12:55:07 josnyder rob_: btw. one of our development servers now has a 'test_alert_for_josnyder' displayed on it in the dashboard
	2015-01-19 12:55:12 josnyder I don't think it will ever go away
	2015-01-19 12:55:16 rob_ ok, at that point i think the original masquerading stuff doesn't change..
	2015-01-19 12:55:55 rob_ josnyder: redis-cli rem 'test_alert_for_josnyder' will probably do the job
	2015-01-19 12:56:15 josnyder yep, that would work
	2015-01-19 12:56:38 rob_ josnyder: would be interesting to leave it for 24hr just to see what happens though
	2015-01-19 12:56:53 josnyder rob_: looking at our uchiwa, we have plenty of checks that are over 24 hours old
	2015-01-19 12:57:38 josnyder I'm seeing green checks from 2015-01-08 and 2015-01-13
	2015-01-19 12:57:54 josnyder I think it would go back further if we hadn't cleared redis during our recent sensu upgrade
	2015-01-19 12:59:15 rob_ josnyder: ok, so this seems like a legit bug then
	2015-01-19 12:59:26 rob_ given the socket functionality exists
	2015-01-19 13:01:01 josnyder rob_: what do you think is a bug, specifically?
	2015-01-19 13:01:15 josnyder I think the behavior we're looking at is actually a feature, of sorts
	2015-01-19 13:02:16 rob_ josnyder: i think it's a problem that if event data for a check isnt updated then it's almost certainly going to be stale data after a period of time
	2015-01-19 13:03:10 josnyder yeah
	2015-01-19 13:03:47 josnyder I think what's going on is that the sensu server has no clue whether an event is stale, or whether it just runs very infrequently
	2015-01-19 13:03:50 rob_ i think it's usually 'ok' since scheduled checks are always going to get their data updated unless the server is down at which point the keepalive event will show the problem
	2015-01-19 13:04:00 josnyder yep
	2015-01-19 13:04:03 rob_ that's true
	2015-01-19 13:04:21 josnyder so if we gave the server some idea of "the check should be updated at least this often"
	2015-01-19 13:04:37 rob_ josnyder: i think that already exists, right?
	2015-01-19 13:04:42 josnyder it does?
	2015-01-19 13:04:58 rob_ well a check definition has interval
	2015-01-19 13:05:35 rob_ so if we add a :timestamp to event data and then the timestamp is less than: Time.now - interval
	2015-01-19 13:05:54 josnyder timestamp is already there
	2015-01-19 13:06:01 josnyder it gets stored as execution:client:check
	2015-01-19 13:06:03 josnyder in redis
	2015-01-19 13:06:16 rob_ oh, sweet
	2015-01-19 13:06:21 josnyder but right now there's no server side logic that makes use of the 'interval' key in check data
	2015-01-19 13:06:37 rob_ so we can just add something to process_result
	2015-01-19 13:07:23 rob_ ah, the timestamp is 'check[:executed]
	2015-01-19 13:07:25 rob_ '
	2015-01-19 13:07:46 rob_ oh, hmm, that wouldnt work
	2015-01-19 13:10:42 rob_ i think we'd have to write the event data - or at least just the interval, to redis
	2015-01-19 13:11:00 josnyder yeah
	2015-01-19 13:11:10 rob_ then have a determine_stale_checks method
	2015-01-19 13:11:28 josnyder well, there would be two possible behaviors when a check is determined to be stale
	2015-01-19 13:11:42 josnyder 1) assume this is the correct operation, and that (for instance) Puppet removed the check because it's no longer desired
	2015-01-19 13:12:03 josnyder 2) assume this is errant operation, and send a CRITICAL event for that check
	2015-01-19 13:12:16 rob_ for 1) you can call the sensu-api to remove the check from redis..
	2015-01-19 13:12:32 rob_ like what uchiwa does if you 'X' a check or host
	2015-01-19 13:12:42 rob_ i think..
	2015-01-19 13:12:59 josnyder I didn't know you could 'X' an invidual check
	2015-01-19 13:13:01 * josnyder looks
	2015-01-19 13:13:16 rob_ i think it's called 'resolve'
	2015-01-19 13:13:43 josnyder resolving a check doesn't remove it
	2015-01-19 13:13:53 rob_ what happens?
	2015-01-19 13:13:58 * josnyder checks that assertion
	2015-01-19 13:14:10 rob_ i thought it just clears the data from redis for that check
	2015-01-19 13:16:15 josnyder rob_: I ran send-test-sensu-alert
	2015-01-19 13:16:25 josnyder and then did 'sensu-cli resolve dev4-devb.dev.yelpcorp.com "test_alert_for_josnyder"'
	2015-01-19 13:16:39 josnyder the check still shows up in uchiwa
	2015-01-19 13:16:48 rob_ josnyder: uchiwa can take a second to update..
	2015-01-19 13:17:01 rob_ what does sensu-cli list events say?
	2015-01-19 13:17:23 rob_ also redis-cil keys 'test_alert_for_josnyder'
	2015-01-19 13:18:08 josnyder yeah, two keys
	2015-01-19 13:18:20 josnyder 1) "execution:dev4-devb.dev.yelpcorp.com:test_alert_for_josnyder"
	2015-01-19 13:18:20 josnyder 2) "history:dev4-devb.dev.yelpcorp.com:test_alert_for_josnyder"Y
	2015-01-19 13:18:57 rob_ what values do they contain?
	2015-01-19 13:19:01 rob_ maybe it just wipes out the history
	2015-01-19 13:20:42 josnyder https://gist.github.com/hashbrowncipher/ccc2a8257c80899282ce
	2015-01-19 13:21:23 rob_ ah so it writes a 0 to history, sneaky
	2015-01-19 13:21:41 rob_ and possibly updates the execution time :)
	2015-01-19 13:24:15 josnyder yeah, it's just this
	2015-01-19 13:24:15 josnyder https://github.com/Yelp/sensu/blob/fake_server_source/lib/sensu/api.rb#L207
	2015-01-19 13:24:33 josnyder all it does is create an event with status=>0
	2015-01-19 13:27:22 rob_ josnyder: so what's next..? could post something to the mailing list and try to get portertechs input and see what he'd accept into mainline..?
	2015-01-19 13:28:15 josnyder that seems like a good idea
	2015-01-19 13:28:33 josnyder I'll also want to talk to some people at Yelp
	2015-01-19 13:29:50 rob_ but you cant do what you want using the sensu-client like how you describe, no
	2015-01-19 13:31:23 rob_ josnyder: interestingly, when you stop a sensu agent and the keepalive goes critical, all the rest of the checks for that host stay OK
	2015-01-19 13:31:29 josnyder yep
	2015-01-19 13:31:37 josnyder they're frozen in time
	2015-01-19 13:32:40 rob_ josnyder: i suppose the expected pattern is to actually care about keepalive alerts
	2015-01-19 13:32:54 rob_ i would prefer to create explicit dependencies between a check and keepalive, i think
	2015-01-19 13:33:05 josnyder I tell people that keepalive alerts mean "your checks on this client aren't being run"
	2015-01-19 13:33:25 josnyder and to forget any nonsense about the host being down or whatnot
	2015-01-19 13:33:33 rob_ yeah
	2015-01-19 13:34:09 rob_ we have a more service oriented architecture where knowing a machine is down doesnt really tell you what services are down
	2015-01-19 13:34:32 rob_ (for better or for worse)
	2015-01-19 13:34:54 josnyder rob_: many of our services aren't monitored on individual hosts
	2015-01-19 13:35:09 josnyder instead, we query the load balancer to ask whether a certain proportion of backends are up
	2015-01-19 13:35:10 rob_ do you do aggregate checks or something?
	2015-01-19 13:35:15 rob_ ah
	2015-01-19 13:35:31 josnyder that's actually a very good example of where I'd like a check timeout
	2015-01-19 13:35:44 josnyder because if the check runs and never returns a result, that's a problem
	2015-01-19 13:35:53 rob_ yeah
No results found