Skip to content

Instantly share code, notes, and snippets.

@hashbrowncipher
Created January 20, 2015 18:03
Show Gist options
  • Select an option

  • Save hashbrowncipher/5d31b00478a8ab704ed2 to your computer and use it in GitHub Desktop.

Select an option

Save hashbrowncipher/5d31b00478a8ab704ed2 to your computer and use it in GitHub Desktop.
conversation between rob_ and josnyder in #sensu
2015-01-19 12:02:46 josnyder hi rob_, t0m
2015-01-19 12:03:19 josnyder i indeed overlooked that we should be using a sensu agent socket to send messages
2015-01-19 12:05:46 josnyder rob_: I'm specifically thinking of a use case where a job gets scheduled over a chronos cluster and sends results to sensu
2015-01-19 12:06:39 josnyder so any of 100s of machines could run the job, and then want to report the same client/check pair to the sensu aggregator
2015-01-19 12:06:52 rob_ josnyder: hey :)
2015-01-19 12:07:29 rob_ josnyder: sounds like you could just send it to a sensu-agent somewhere then, right?
2015-01-19 12:07:38 josnyder yeah, each of the nodes will have a sensu agent
2015-01-19 12:07:40 rob_ could even have a few sensu agents running and load balance them :)
2015-01-19 12:07:44 josnyder yep
2015-01-19 12:07:58 josnyder but there's no guarantee that the job will be scheduled or run correctly
2015-01-19 12:08:11 josnyder so I'd like to be notified if a check result fails to be sent
2015-01-19 12:09:12 rob_ josnyder: ah, and that's why you wanted to introduce per-check keepalives?
2015-01-19 12:09:18 josnyder yep
2015-01-19 12:09:37 rob_ josnyder: did you see how sensu server processes *any* check data as a keepalive, as well as the default keepalive subscription?
2015-01-19 12:09:56 josnyder no...this is interesting to me
2015-01-19 12:10:35 rob_ josnyder: im guessing the actual keepalive subscription is there for when a server has no checks defined
2015-01-19 12:11:35 rob_ josnyder: determine_stale_clients actually loops over all client instances when it checks for machines with failed keepalives, see: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L573
2015-01-19 12:12:22 josnyder rob_: yeah, but there can only be one check on each client with this functionality, right?
2015-01-19 12:13:56 rob_ josnyder: oh wait, i think i misunderstood something..
2015-01-19 12:14:23 rob_ josnyder: the client data is sent with every keepalive, so a keepalive isnt a check, it's an update of the entire client data in redis...
2015-01-19 12:14:38 rob_ (i think)
2015-01-19 12:14:41 * josnyder checks
2015-01-19 12:15:56 josnyder rob_: the only place I see the client:hostname key in redis being set is https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L34
2015-01-19 12:18:21 rob_ josnyder: yeah, so that subscription sets up the (re-)writing of the client data to redis on every keepalive
2015-01-19 12:18:41 josnyder ah, yep
2015-01-19 12:18:47 josnyder that block gets called on every keepalive
2015-01-19 12:19:02 rob_ then the timestamp is read from the client data here: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L585
2015-01-19 12:19:40 rob_ ok, so lets look at your usecase again
2015-01-19 12:20:33 rob_ i would consider using some kind of proxy between a sensu-agent and your event producer that sends a warn/crit event to sensu if a check isnt received when one is expected
2015-01-19 12:21:01 rob_ that way you can use the (very simple) masquerading that we've already come up with without adding extra complexity to sensu but still get the functionality you want
2015-01-19 12:21:04 josnyder yep, that's one possible implementation
2015-01-19 12:21:38 josnyder I think it's fair to go down that road
2015-01-19 12:23:17 josnyder the thing I find interesting is how much the implementation of such a proxy would match the implementation of determine_stale_clients
2015-01-19 12:23:25 josnyder it would be basically identical
2015-01-19 12:24:06 josnyder that's the only good reason I have for implementing that logic within sensu
2015-01-19 12:24:28 josnyder rob_: you've taught me a bit; thanks
2015-01-19 12:24:56 rob_ josnyder: you could create a sensu-client that only sends a keepalive when it receives a peice of event-data
2015-01-19 12:25:22 josnyder rob_: why don't we spin this discussion out into a new github issue
2015-01-19 12:25:26 rob_ instead of being an agent it'd be more like proxy since it needn't schedule anything
2015-01-19 12:25:44 josnyder yep, but then you need a new source for every check
2015-01-19 12:26:04 josnyder say I have 400 batch jobs running on my chronos cluster
2015-01-19 12:26:26 josnyder monitoring each one would be 400 proxies (or one proxy that acts like 400 clients)
2015-01-19 12:27:06 rob_ i was picturing each job configured with the same source sending event data through a single proxy
2015-01-19 12:27:21 josnyder yeah, but there's no guarantee that the jobs fail together at once
2015-01-19 12:27:26 rob_ maybe the mental image i have of what you're talking about is wrong :)
2015-01-19 12:27:49 rob_ ah, so you care about each individual batch job?
2015-01-19 12:27:55 josnyder yep
2015-01-19 12:28:19 josnyder if someone makes a code change that causes a given batch job to fail, I want a single check to go critical on the other end
2015-01-19 12:29:39 josnyder rob_: at this point I think you should be doing what t0m suggested
2015-01-19 12:29:45 josnyder which is to say 'patches welcome'
2015-01-19 12:30:00 josnyder unless, of course, you have a lingering architectural objection
2015-01-19 12:30:03 rob_ ok, i think i get it.. i think the first test i'd do is to try sending event data through the sensu socket and then seeing what happens if you only send one event..
2015-01-19 12:30:40 josnyder we actually have a script that solarkennedy wrote
2015-01-19 12:30:42 rob_ josnyder: i have no objections, i think we're going to be deploying mesos where i work this quarter so this is of interest to me :)
2015-01-19 12:30:44 josnyder that does exactly that
2015-01-19 12:31:05 rob_ josnyder: what happens..?
2015-01-19 12:31:25 josnyder the check shows up in sensu, and...lingers
2015-01-19 12:31:26 josnyder iirc
2015-01-19 12:31:36 josnyder let me check
2015-01-19 12:31:43 rob_ ah, so it's state never changes?
2015-01-19 12:32:07 rob_ does the client do anything clever with event data it receives through the socket?
2015-01-19 12:32:18 * rob_ goes to have a look
2015-01-19 12:33:29 rob_ so i think the socket is just a tunnel to rabbitmq..?
2015-01-19 12:35:11 rob_ josnyder: are you talking about his sensu-shell-helper stuff? i've been meaning to take a look at it :)
2015-01-19 12:35:38 josnyder I may be...
2015-01-19 12:35:42 josnyder let me see if it's open source already
2015-01-19 12:35:51 josnyder if not...I can post it as a gist
2015-01-19 12:38:26 josnyder rob_: https://gist.github.com/hashbrowncipher/c044bc60a2c8d577dc50
2015-01-19 12:38:45 josnyder it's really convenient
2015-01-19 12:38:50 rob_ oh cool!
2015-01-19 12:39:00 rob_ https://github.com/sensu/sensu/blob/master/lib/sensu/client.rb#L230
2015-01-19 12:39:17 rob_ this looks like where the client opens a socket and binds it to the rabbitmq transport
2015-01-19 12:39:37 rob_ i cant find anything in the code that looks like it's modifying the event data
2015-01-19 12:39:51 rob_ unless it's in sensu-em
2015-01-19 12:41:01 rob_ ah that's just a fork of eventmachine
2015-01-19 12:41:15 josnyder rob_: so yeah, one of my hosts now has a resolved test_alert_for_josnyder
2015-01-19 12:41:21 josnyder I don't think it will ever go away
2015-01-19 12:41:40 josnyder well...until redis loses track of the key somehow
2015-01-19 12:42:07 rob_ josnyder: ok :) now your idea to add a timeout to each event data makes more sense to me
2015-01-19 12:42:22 rob_ er, timestamp
2015-01-19 12:45:58 * josnyder sends some probing queries to redis
2015-01-19 12:46:31 rob_ i wonder how nagios would handle this
2015-01-19 12:48:37 josnyder i mean, right no we're using nagios with passive checks and passive check timeouts
2015-01-19 12:48:52 rob_ ah, right, of course
2015-01-19 12:48:53 josnyder the feature is pretty directly analogous to how nagios does it
2015-01-19 12:51:20 rob_ hmm
2015-01-19 12:53:52 rob_ ok, so basically there would be a 'determine_stale_checks' loop that would publish a critical event data for a check if whatever threshold is breached
2015-01-19 12:54:00 rob_ maybe it wouldnt be that much effort to impliment
2015-01-19 12:54:39 josnyder I hope so
2015-01-19 12:55:07 josnyder rob_: btw. one of our development servers now has a 'test_alert_for_josnyder' displayed on it in the dashboard
2015-01-19 12:55:12 josnyder I don't think it will ever go away
2015-01-19 12:55:16 rob_ ok, at that point i think the original masquerading stuff doesn't change..
2015-01-19 12:55:55 rob_ josnyder: redis-cli rem '*test_alert_for_josnyder*' will probably do the job
2015-01-19 12:56:15 josnyder yep, that would work
2015-01-19 12:56:38 rob_ josnyder: would be interesting to leave it for 24hr just to see what happens though
2015-01-19 12:56:53 josnyder rob_: looking at our uchiwa, we have plenty of checks that are over 24 hours old
2015-01-19 12:57:38 josnyder I'm seeing green checks from 2015-01-08 and 2015-01-13
2015-01-19 12:57:54 josnyder I think it would go back further if we hadn't cleared redis during our recent sensu upgrade
2015-01-19 12:59:15 rob_ josnyder: ok, so this seems like a legit bug then
2015-01-19 12:59:26 rob_ given the socket functionality exists
2015-01-19 13:01:01 josnyder rob_: what do you think is a bug, specifically?
2015-01-19 13:01:15 josnyder I think the behavior we're looking at is actually a feature, of sorts
2015-01-19 13:02:16 rob_ josnyder: i think it's a problem that if event data for a check isnt updated then it's almost certainly going to be stale data after a period of time
2015-01-19 13:03:10 josnyder yeah
2015-01-19 13:03:47 josnyder I think what's going on is that the sensu server has no clue whether an event is stale, or whether it just runs very infrequently
2015-01-19 13:03:50 rob_ i think it's usually 'ok' since scheduled checks are always going to get their data updated unless the server is down at which point the keepalive event will show the problem
2015-01-19 13:04:00 josnyder yep
2015-01-19 13:04:03 rob_ that's true
2015-01-19 13:04:21 josnyder so if we gave the server some idea of "the check should be updated at least this often"
2015-01-19 13:04:37 rob_ josnyder: i think that already exists, right?
2015-01-19 13:04:42 josnyder it does?
2015-01-19 13:04:58 rob_ well a check definition has interval
2015-01-19 13:05:35 rob_ so if we add a :timestamp to event data and then the timestamp is less than: Time.now - interval
2015-01-19 13:05:54 josnyder timestamp is already there
2015-01-19 13:06:01 josnyder it gets stored as execution:client:check
2015-01-19 13:06:03 josnyder in redis
2015-01-19 13:06:16 rob_ oh, sweet
2015-01-19 13:06:21 josnyder but right now there's no server side logic that makes use of the 'interval' key in check data
2015-01-19 13:06:37 rob_ so we can just add something to process_result
2015-01-19 13:07:23 rob_ ah, the timestamp is 'check[:executed]
2015-01-19 13:07:25 rob_ '
2015-01-19 13:07:46 rob_ oh, hmm, that wouldnt work
2015-01-19 13:10:42 rob_ i think we'd have to write the event data - or at least just the interval, to redis
2015-01-19 13:11:00 josnyder yeah
2015-01-19 13:11:10 rob_ then have a determine_stale_checks method
2015-01-19 13:11:28 josnyder well, there would be two possible behaviors when a check is determined to be stale
2015-01-19 13:11:42 josnyder 1) assume this is the correct operation, and that (for instance) Puppet removed the check because it's no longer desired
2015-01-19 13:12:03 josnyder 2) assume this is errant operation, and send a CRITICAL event for that check
2015-01-19 13:12:16 rob_ for 1) you can call the sensu-api to remove the check from redis..
2015-01-19 13:12:32 rob_ like what uchiwa does if you 'X' a check or host
2015-01-19 13:12:42 rob_ i think..
2015-01-19 13:12:59 josnyder I didn't know you could 'X' an invidual check
2015-01-19 13:13:01 * josnyder looks
2015-01-19 13:13:16 rob_ i think it's called 'resolve'
2015-01-19 13:13:43 josnyder resolving a check doesn't remove it
2015-01-19 13:13:53 rob_ what happens?
2015-01-19 13:13:58 * josnyder checks that assertion
2015-01-19 13:14:10 rob_ i thought it just clears the data from redis for that check
2015-01-19 13:16:15 josnyder rob_: I ran send-test-sensu-alert
2015-01-19 13:16:25 josnyder and then did 'sensu-cli resolve dev4-devb.dev.yelpcorp.com "test_alert_for_josnyder"'
2015-01-19 13:16:39 josnyder the check still shows up in uchiwa
2015-01-19 13:16:48 rob_ josnyder: uchiwa can take a second to update..
2015-01-19 13:17:01 rob_ what does sensu-cli list events say?
2015-01-19 13:17:23 rob_ also redis-cil keys '*test_alert_for_josnyder*'
2015-01-19 13:18:08 josnyder yeah, two keys
2015-01-19 13:18:20 josnyder 1) "execution:dev4-devb.dev.yelpcorp.com:test_alert_for_josnyder"
2015-01-19 13:18:20 josnyder 2) "history:dev4-devb.dev.yelpcorp.com:test_alert_for_josnyder"Y
2015-01-19 13:18:57 rob_ what values do they contain?
2015-01-19 13:19:01 rob_ maybe it just wipes out the history
2015-01-19 13:20:42 josnyder https://gist.github.com/hashbrowncipher/ccc2a8257c80899282ce
2015-01-19 13:21:23 rob_ ah so it writes a 0 to history, sneaky
2015-01-19 13:21:41 rob_ and possibly updates the execution time :)
2015-01-19 13:24:15 josnyder yeah, it's just this
2015-01-19 13:24:15 josnyder https://github.com/Yelp/sensu/blob/fake_server_source/lib/sensu/api.rb#L207
2015-01-19 13:24:33 josnyder all it does is create an event with status=>0
2015-01-19 13:27:22 rob_ josnyder: so what's next..? could post something to the mailing list and try to get portertechs input and see what he'd accept into mainline..?
2015-01-19 13:28:15 josnyder that seems like a good idea
2015-01-19 13:28:33 josnyder I'll also want to talk to some people at Yelp
2015-01-19 13:29:50 rob_ but you cant do what you want using the sensu-client like how you describe, no
2015-01-19 13:31:23 rob_ josnyder: interestingly, when you stop a sensu agent and the keepalive goes critical, all the rest of the checks for that host stay OK
2015-01-19 13:31:29 josnyder yep
2015-01-19 13:31:37 josnyder they're frozen in time
2015-01-19 13:32:40 rob_ josnyder: i suppose the expected pattern is to actually care about keepalive alerts
2015-01-19 13:32:54 rob_ i would prefer to create explicit dependencies between a check and keepalive, i think
2015-01-19 13:33:05 josnyder I tell people that keepalive alerts mean "your checks on this client aren't being run"
2015-01-19 13:33:25 josnyder and to forget any nonsense about the host being down or whatnot
2015-01-19 13:33:33 rob_ yeah
2015-01-19 13:34:09 rob_ we have a more service oriented architecture where knowing a machine is down doesnt really tell you what services are down
2015-01-19 13:34:32 rob_ (for better or for worse)
2015-01-19 13:34:54 josnyder rob_: many of our services aren't monitored on individual hosts
2015-01-19 13:35:09 josnyder instead, we query the load balancer to ask whether a certain proportion of backends are up
2015-01-19 13:35:10 rob_ do you do aggregate checks or something?
2015-01-19 13:35:15 rob_ ah
2015-01-19 13:35:31 josnyder that's actually a very good example of where I'd like a check timeout
2015-01-19 13:35:44 josnyder because if the check runs and never returns a result, that's a problem
2015-01-19 13:35:53 rob_ yeah
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment