Created
January 20, 2015 18:03
-
-
Save hashbrowncipher/5d31b00478a8ab704ed2 to your computer and use it in GitHub Desktop.
conversation between rob_ and josnyder in #sensu
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 2015-01-19 12:02:46 josnyder hi rob_, t0m | |
| 2015-01-19 12:03:19 josnyder i indeed overlooked that we should be using a sensu agent socket to send messages | |
| 2015-01-19 12:05:46 josnyder rob_: I'm specifically thinking of a use case where a job gets scheduled over a chronos cluster and sends results to sensu | |
| 2015-01-19 12:06:39 josnyder so any of 100s of machines could run the job, and then want to report the same client/check pair to the sensu aggregator | |
| 2015-01-19 12:06:52 rob_ josnyder: hey :) | |
| 2015-01-19 12:07:29 rob_ josnyder: sounds like you could just send it to a sensu-agent somewhere then, right? | |
| 2015-01-19 12:07:38 josnyder yeah, each of the nodes will have a sensu agent | |
| 2015-01-19 12:07:40 rob_ could even have a few sensu agents running and load balance them :) | |
| 2015-01-19 12:07:44 josnyder yep | |
| 2015-01-19 12:07:58 josnyder but there's no guarantee that the job will be scheduled or run correctly | |
| 2015-01-19 12:08:11 josnyder so I'd like to be notified if a check result fails to be sent | |
| 2015-01-19 12:09:12 rob_ josnyder: ah, and that's why you wanted to introduce per-check keepalives? | |
| 2015-01-19 12:09:18 josnyder yep | |
| 2015-01-19 12:09:37 rob_ josnyder: did you see how sensu server processes *any* check data as a keepalive, as well as the default keepalive subscription? | |
| 2015-01-19 12:09:56 josnyder no...this is interesting to me | |
| 2015-01-19 12:10:35 rob_ josnyder: im guessing the actual keepalive subscription is there for when a server has no checks defined | |
| 2015-01-19 12:11:35 rob_ josnyder: determine_stale_clients actually loops over all client instances when it checks for machines with failed keepalives, see: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L573 | |
| 2015-01-19 12:12:22 josnyder rob_: yeah, but there can only be one check on each client with this functionality, right? | |
| 2015-01-19 12:13:56 rob_ josnyder: oh wait, i think i misunderstood something.. | |
| 2015-01-19 12:14:23 rob_ josnyder: the client data is sent with every keepalive, so a keepalive isnt a check, it's an update of the entire client data in redis... | |
| 2015-01-19 12:14:38 rob_ (i think) | |
| 2015-01-19 12:14:41 * josnyder checks | |
| 2015-01-19 12:15:56 josnyder rob_: the only place I see the client:hostname key in redis being set is https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L34 | |
| 2015-01-19 12:18:21 rob_ josnyder: yeah, so that subscription sets up the (re-)writing of the client data to redis on every keepalive | |
| 2015-01-19 12:18:41 josnyder ah, yep | |
| 2015-01-19 12:18:47 josnyder that block gets called on every keepalive | |
| 2015-01-19 12:19:02 rob_ then the timestamp is read from the client data here: https://github.com/sensu/sensu/blob/master/lib/sensu/server.rb#L585 | |
| 2015-01-19 12:19:40 rob_ ok, so lets look at your usecase again | |
| 2015-01-19 12:20:33 rob_ i would consider using some kind of proxy between a sensu-agent and your event producer that sends a warn/crit event to sensu if a check isnt received when one is expected | |
| 2015-01-19 12:21:01 rob_ that way you can use the (very simple) masquerading that we've already come up with without adding extra complexity to sensu but still get the functionality you want | |
| 2015-01-19 12:21:04 josnyder yep, that's one possible implementation | |
| 2015-01-19 12:21:38 josnyder I think it's fair to go down that road | |
| 2015-01-19 12:23:17 josnyder the thing I find interesting is how much the implementation of such a proxy would match the implementation of determine_stale_clients | |
| 2015-01-19 12:23:25 josnyder it would be basically identical | |
| 2015-01-19 12:24:06 josnyder that's the only good reason I have for implementing that logic within sensu | |
| 2015-01-19 12:24:28 josnyder rob_: you've taught me a bit; thanks | |
| 2015-01-19 12:24:56 rob_ josnyder: you could create a sensu-client that only sends a keepalive when it receives a peice of event-data | |
| 2015-01-19 12:25:22 josnyder rob_: why don't we spin this discussion out into a new github issue | |
| 2015-01-19 12:25:26 rob_ instead of being an agent it'd be more like proxy since it needn't schedule anything | |
| 2015-01-19 12:25:44 josnyder yep, but then you need a new source for every check | |
| 2015-01-19 12:26:04 josnyder say I have 400 batch jobs running on my chronos cluster | |
| 2015-01-19 12:26:26 josnyder monitoring each one would be 400 proxies (or one proxy that acts like 400 clients) | |
| 2015-01-19 12:27:06 rob_ i was picturing each job configured with the same source sending event data through a single proxy | |
| 2015-01-19 12:27:21 josnyder yeah, but there's no guarantee that the jobs fail together at once | |
| 2015-01-19 12:27:26 rob_ maybe the mental image i have of what you're talking about is wrong :) | |
| 2015-01-19 12:27:49 rob_ ah, so you care about each individual batch job? | |
| 2015-01-19 12:27:55 josnyder yep | |
| 2015-01-19 12:28:19 josnyder if someone makes a code change that causes a given batch job to fail, I want a single check to go critical on the other end | |
| 2015-01-19 12:29:39 josnyder rob_: at this point I think you should be doing what t0m suggested | |
| 2015-01-19 12:29:45 josnyder which is to say 'patches welcome' | |
| 2015-01-19 12:30:00 josnyder unless, of course, you have a lingering architectural objection | |
| 2015-01-19 12:30:03 rob_ ok, i think i get it.. i think the first test i'd do is to try sending event data through the sensu socket and then seeing what happens if you only send one event.. | |
| 2015-01-19 12:30:40 josnyder we actually have a script that solarkennedy wrote | |
| 2015-01-19 12:30:42 rob_ josnyder: i have no objections, i think we're going to be deploying mesos where i work this quarter so this is of interest to me :) | |
| 2015-01-19 12:30:44 josnyder that does exactly that | |
| 2015-01-19 12:31:05 rob_ josnyder: what happens..? | |
| 2015-01-19 12:31:25 josnyder the check shows up in sensu, and...lingers | |
| 2015-01-19 12:31:26 josnyder iirc | |
| 2015-01-19 12:31:36 josnyder let me check | |
| 2015-01-19 12:31:43 rob_ ah, so it's state never changes? | |
| 2015-01-19 12:32:07 rob_ does the client do anything clever with event data it receives through the socket? | |
| 2015-01-19 12:32:18 * rob_ goes to have a look | |
| 2015-01-19 12:33:29 rob_ so i think the socket is just a tunnel to rabbitmq..? | |
| 2015-01-19 12:35:11 rob_ josnyder: are you talking about his sensu-shell-helper stuff? i've been meaning to take a look at it :) | |
| 2015-01-19 12:35:38 josnyder I may be... | |
| 2015-01-19 12:35:42 josnyder let me see if it's open source already | |
| 2015-01-19 12:35:51 josnyder if not...I can post it as a gist | |
| 2015-01-19 12:38:26 josnyder rob_: https://gist.github.com/hashbrowncipher/c044bc60a2c8d577dc50 | |
| 2015-01-19 12:38:45 josnyder it's really convenient | |
| 2015-01-19 12:38:50 rob_ oh cool! | |
| 2015-01-19 12:39:00 rob_ https://github.com/sensu/sensu/blob/master/lib/sensu/client.rb#L230 | |
| 2015-01-19 12:39:17 rob_ this looks like where the client opens a socket and binds it to the rabbitmq transport | |
| 2015-01-19 12:39:37 rob_ i cant find anything in the code that looks like it's modifying the event data | |
| 2015-01-19 12:39:51 rob_ unless it's in sensu-em | |
| 2015-01-19 12:41:01 rob_ ah that's just a fork of eventmachine | |
| 2015-01-19 12:41:15 josnyder rob_: so yeah, one of my hosts now has a resolved test_alert_for_josnyder | |
| 2015-01-19 12:41:21 josnyder I don't think it will ever go away | |
| 2015-01-19 12:41:40 josnyder well...until redis loses track of the key somehow | |
| 2015-01-19 12:42:07 rob_ josnyder: ok :) now your idea to add a timeout to each event data makes more sense to me | |
| 2015-01-19 12:42:22 rob_ er, timestamp | |
| 2015-01-19 12:45:58 * josnyder sends some probing queries to redis | |
| 2015-01-19 12:46:31 rob_ i wonder how nagios would handle this | |
| 2015-01-19 12:48:37 josnyder i mean, right no we're using nagios with passive checks and passive check timeouts | |
| 2015-01-19 12:48:52 rob_ ah, right, of course | |
| 2015-01-19 12:48:53 josnyder the feature is pretty directly analogous to how nagios does it | |
| 2015-01-19 12:51:20 rob_ hmm | |
| 2015-01-19 12:53:52 rob_ ok, so basically there would be a 'determine_stale_checks' loop that would publish a critical event data for a check if whatever threshold is breached | |
| 2015-01-19 12:54:00 rob_ maybe it wouldnt be that much effort to impliment | |
| 2015-01-19 12:54:39 josnyder I hope so | |
| 2015-01-19 12:55:07 josnyder rob_: btw. one of our development servers now has a 'test_alert_for_josnyder' displayed on it in the dashboard | |
| 2015-01-19 12:55:12 josnyder I don't think it will ever go away | |
| 2015-01-19 12:55:16 rob_ ok, at that point i think the original masquerading stuff doesn't change.. | |
| 2015-01-19 12:55:55 rob_ josnyder: redis-cli rem '*test_alert_for_josnyder*' will probably do the job | |
| 2015-01-19 12:56:15 josnyder yep, that would work | |
| 2015-01-19 12:56:38 rob_ josnyder: would be interesting to leave it for 24hr just to see what happens though | |
| 2015-01-19 12:56:53 josnyder rob_: looking at our uchiwa, we have plenty of checks that are over 24 hours old | |
| 2015-01-19 12:57:38 josnyder I'm seeing green checks from 2015-01-08 and 2015-01-13 | |
| 2015-01-19 12:57:54 josnyder I think it would go back further if we hadn't cleared redis during our recent sensu upgrade | |
| 2015-01-19 12:59:15 rob_ josnyder: ok, so this seems like a legit bug then | |
| 2015-01-19 12:59:26 rob_ given the socket functionality exists | |
| 2015-01-19 13:01:01 josnyder rob_: what do you think is a bug, specifically? | |
| 2015-01-19 13:01:15 josnyder I think the behavior we're looking at is actually a feature, of sorts | |
| 2015-01-19 13:02:16 rob_ josnyder: i think it's a problem that if event data for a check isnt updated then it's almost certainly going to be stale data after a period of time | |
| 2015-01-19 13:03:10 josnyder yeah | |
| 2015-01-19 13:03:47 josnyder I think what's going on is that the sensu server has no clue whether an event is stale, or whether it just runs very infrequently | |
| 2015-01-19 13:03:50 rob_ i think it's usually 'ok' since scheduled checks are always going to get their data updated unless the server is down at which point the keepalive event will show the problem | |
| 2015-01-19 13:04:00 josnyder yep | |
| 2015-01-19 13:04:03 rob_ that's true | |
| 2015-01-19 13:04:21 josnyder so if we gave the server some idea of "the check should be updated at least this often" | |
| 2015-01-19 13:04:37 rob_ josnyder: i think that already exists, right? | |
| 2015-01-19 13:04:42 josnyder it does? | |
| 2015-01-19 13:04:58 rob_ well a check definition has interval | |
| 2015-01-19 13:05:35 rob_ so if we add a :timestamp to event data and then the timestamp is less than: Time.now - interval | |
| 2015-01-19 13:05:54 josnyder timestamp is already there | |
| 2015-01-19 13:06:01 josnyder it gets stored as execution:client:check | |
| 2015-01-19 13:06:03 josnyder in redis | |
| 2015-01-19 13:06:16 rob_ oh, sweet | |
| 2015-01-19 13:06:21 josnyder but right now there's no server side logic that makes use of the 'interval' key in check data | |
| 2015-01-19 13:06:37 rob_ so we can just add something to process_result | |
| 2015-01-19 13:07:23 rob_ ah, the timestamp is 'check[:executed] | |
| 2015-01-19 13:07:25 rob_ ' | |
| 2015-01-19 13:07:46 rob_ oh, hmm, that wouldnt work | |
| 2015-01-19 13:10:42 rob_ i think we'd have to write the event data - or at least just the interval, to redis | |
| 2015-01-19 13:11:00 josnyder yeah | |
| 2015-01-19 13:11:10 rob_ then have a determine_stale_checks method | |
| 2015-01-19 13:11:28 josnyder well, there would be two possible behaviors when a check is determined to be stale | |
| 2015-01-19 13:11:42 josnyder 1) assume this is the correct operation, and that (for instance) Puppet removed the check because it's no longer desired | |
| 2015-01-19 13:12:03 josnyder 2) assume this is errant operation, and send a CRITICAL event for that check | |
| 2015-01-19 13:12:16 rob_ for 1) you can call the sensu-api to remove the check from redis.. | |
| 2015-01-19 13:12:32 rob_ like what uchiwa does if you 'X' a check or host | |
| 2015-01-19 13:12:42 rob_ i think.. | |
| 2015-01-19 13:12:59 josnyder I didn't know you could 'X' an invidual check | |
| 2015-01-19 13:13:01 * josnyder looks | |
| 2015-01-19 13:13:16 rob_ i think it's called 'resolve' | |
| 2015-01-19 13:13:43 josnyder resolving a check doesn't remove it | |
| 2015-01-19 13:13:53 rob_ what happens? | |
| 2015-01-19 13:13:58 * josnyder checks that assertion | |
| 2015-01-19 13:14:10 rob_ i thought it just clears the data from redis for that check | |
| 2015-01-19 13:16:15 josnyder rob_: I ran send-test-sensu-alert | |
| 2015-01-19 13:16:25 josnyder and then did 'sensu-cli resolve dev4-devb.dev.yelpcorp.com "test_alert_for_josnyder"' | |
| 2015-01-19 13:16:39 josnyder the check still shows up in uchiwa | |
| 2015-01-19 13:16:48 rob_ josnyder: uchiwa can take a second to update.. | |
| 2015-01-19 13:17:01 rob_ what does sensu-cli list events say? | |
| 2015-01-19 13:17:23 rob_ also redis-cil keys '*test_alert_for_josnyder*' | |
| 2015-01-19 13:18:08 josnyder yeah, two keys | |
| 2015-01-19 13:18:20 josnyder 1) "execution:dev4-devb.dev.yelpcorp.com:test_alert_for_josnyder" | |
| 2015-01-19 13:18:20 josnyder 2) "history:dev4-devb.dev.yelpcorp.com:test_alert_for_josnyder"Y | |
| 2015-01-19 13:18:57 rob_ what values do they contain? | |
| 2015-01-19 13:19:01 rob_ maybe it just wipes out the history | |
| 2015-01-19 13:20:42 josnyder https://gist.github.com/hashbrowncipher/ccc2a8257c80899282ce | |
| 2015-01-19 13:21:23 rob_ ah so it writes a 0 to history, sneaky | |
| 2015-01-19 13:21:41 rob_ and possibly updates the execution time :) | |
| 2015-01-19 13:24:15 josnyder yeah, it's just this | |
| 2015-01-19 13:24:15 josnyder https://github.com/Yelp/sensu/blob/fake_server_source/lib/sensu/api.rb#L207 | |
| 2015-01-19 13:24:33 josnyder all it does is create an event with status=>0 | |
| 2015-01-19 13:27:22 rob_ josnyder: so what's next..? could post something to the mailing list and try to get portertechs input and see what he'd accept into mainline..? | |
| 2015-01-19 13:28:15 josnyder that seems like a good idea | |
| 2015-01-19 13:28:33 josnyder I'll also want to talk to some people at Yelp | |
| 2015-01-19 13:29:50 rob_ but you cant do what you want using the sensu-client like how you describe, no | |
| 2015-01-19 13:31:23 rob_ josnyder: interestingly, when you stop a sensu agent and the keepalive goes critical, all the rest of the checks for that host stay OK | |
| 2015-01-19 13:31:29 josnyder yep | |
| 2015-01-19 13:31:37 josnyder they're frozen in time | |
| 2015-01-19 13:32:40 rob_ josnyder: i suppose the expected pattern is to actually care about keepalive alerts | |
| 2015-01-19 13:32:54 rob_ i would prefer to create explicit dependencies between a check and keepalive, i think | |
| 2015-01-19 13:33:05 josnyder I tell people that keepalive alerts mean "your checks on this client aren't being run" | |
| 2015-01-19 13:33:25 josnyder and to forget any nonsense about the host being down or whatnot | |
| 2015-01-19 13:33:33 rob_ yeah | |
| 2015-01-19 13:34:09 rob_ we have a more service oriented architecture where knowing a machine is down doesnt really tell you what services are down | |
| 2015-01-19 13:34:32 rob_ (for better or for worse) | |
| 2015-01-19 13:34:54 josnyder rob_: many of our services aren't monitored on individual hosts | |
| 2015-01-19 13:35:09 josnyder instead, we query the load balancer to ask whether a certain proportion of backends are up | |
| 2015-01-19 13:35:10 rob_ do you do aggregate checks or something? | |
| 2015-01-19 13:35:15 rob_ ah | |
| 2015-01-19 13:35:31 josnyder that's actually a very good example of where I'd like a check timeout | |
| 2015-01-19 13:35:44 josnyder because if the check runs and never returns a result, that's a problem | |
| 2015-01-19 13:35:53 rob_ yeah |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment