Skip to content

Instantly share code, notes, and snippets.

@sheac
Last active June 29, 2018 22:29
Show Gist options
  • Save sheac/aaf24292beaea14f28bef75d335c927e to your computer and use it in GitHub Desktop.
Save sheac/aaf24292beaea14f28bef75d335c927e to your computer and use it in GitHub Desktop.

Intro

First, I don't know the answer to this yet, so I can't tell you. What I can do is show you how I'd find the answer.

I'll be using line numbers, which will change when you merge master back into your branch. The commit I'm working off of is the current git HEAD of your PR.

Start with the error message

In this case, it's pretty unhelpful: expected: 13 / got: 5 is pretty much useless.

Check the line number on which the error occurs

The stack trace you posted was:

     # ./spec/helpers.rb:919:in `check_job_messages_from_events'
     # ./spec/helpers.rb:1037:in `check_job_from_events'
     # ./spec/helpers.rb:789:in `rescue in assert_job_from_events'
     # ./spec/helpers.rb:783:in `assert_job_from_events'
     # ./spec/helpers.rb:774:in `block in assert_all_jobs_from_events'
     # ./spec/helpers.rb:768:in `each'
     # ./spec/helpers.rb:768:in `assert_all_jobs_from_events'
     # ./spec/job_spec.rb:103:in `block (2 levels) in <top (required)>'

Let's start from the top: spec/helpers.rb:919:

expect(got_messages.size).to eq (expected_messages.size

But what are got_messages and expected_messages? Well they're function parameters. So let's go one level up to spec/helpers.rb:1037 (broken into multiple lines for readability):

check_job_messages_from_events(
    got_state.messages,
    WI.MessagingService(auth_token) { |cli|
        cli.get(job_id, 0, M::Messaging::GetMessageOptions.new({canHandlePendingDocuments: true, limit: -1}))
     }.messages
 )

Looking up a few lines to :1029, we see that got_state is:

WI.JobEventService(auth_token) { |cli| cli.getJobStateFromBeginning(job_id) 

Any time you see WI.SomethingService(auth_token) { |cli| cli.someMethod() } you know you're looking at an API. So what we're looking at is the messages field on the results of two different API calls.

Since these are API calls, we can match the SomethingService part up with a controller in Mothership, and the someMethod part up with a function on the controller. That means MessagingService.get() is MessagingHandler.Get() in controllers/message_controller.go:178. Similarly, JobEventServcie.getJobStateFromBeginning() is JobEventsHandler.GetJobStateFromBeginning() in controllers/job_events_controller.go:49

Let's start with JobEventServcie.getJobStateFromBeginning(). The return value is populated with the job, msgs and snippets variables. Those are gotten from the CreateJobFromEvents() call.

When we look at where the arguments to that function come from, we should consider the most interesting one: tJobEvents. (The context and the job ID are pretty common.) tJobEvents is a transformed version of jobEvents, from the call to GetJobEventsAfter().

So this function is one that grabs all the "events" for a job and then reconstructs the job based on them.

But what's an event? And why do this?

I'm not super clear on this, myself. But my understanding is that for most of our edit/mutate/update operations on jobs, we fire off these things called "events". In theory, the events provide a history of the job that's so complete the job's current state can be recreated based on it.

So that's what this function does: It finds a job's history and tries to build the final event from that.

This helps us understand a data point you provided in your Jira comment:

if I comment the following code then my event reconstruction issue is resolve. Could you please help me on this?

models.RecordCauserJobOp(ctx, handler.Job,
			job_event.JobOp{UpdateDocument: &common_event.UpdateDocumentOp{
				Document: thrifter.ToThriftDocument(doc),
			}})

Let's dig into RecordCauserJobOpWithDetails() (called directly by RecordCauserJobOp()). At the bottom is a function PostFromContext(). It turns out that one calls sink.post(). After some snooping around, you see that this is all connected with our Kafka client, Sarama.

So this means we're pushing events into Kafka when we "record a job op(eration)". In other words, all of the RecordJobOp() family of functions in models/events.go is responsible for producing job events in response to job operations.

So now we're getting a sense of how the Golang/Mothership side of job events works. We can apply the same investigation to MessagingHandler.Get() to figure out what's going on there.

Returning to check_job_messages_from_events()

The failure is happening in check_job_messages_from_events(). We should return there to understand what's going wrong.

Take a look at the top few lines of check_job_from_events(), which calls check_job_messages_from_events():

    expected = WI.JobService(auth_token) { |cli| cli.fetch(select_ops_all_true(M::Job::JobSelectOpts), [job_id]) }[job_id]
    return if expected.get_set_field == :err
    expected = expected.job
    got_state = WI.JobEventService(auth_token) { |cli| cli.getJobStateFromBeginning(job_id) }

Just based on the names of the variables expected and got_state, we see that there's a comparison being set up between the value we expect to see and the value we get. But where are the expectation and the actual coming from?

Let's use our trick of converting WI.SomethingService ... Ruby code to a Golang controller function. WI.JobService(auth_token) { |cli| cli.fetch() ... } == JobHandler.Fetch(). And as we've already seen, WI.JobEventService(auth_token) { |cli| cli.getJobStateFromBeginning() ... } == JobEventServcie.getJobStateFromBeginning().

JobHandler.Fetch() returns a map of JobFetchResult_s whose primary components are JobModular. Further, we know that the JobModular comes from the database directly.

JobEventService.getJobStateFromBeginning() returns a JobState, that contains a JobModular, as well as other things, like an array of Messages and an array of ExecutionSnippets. But unlike with JobHandler.Fetch(), we learned above in our explorations that the components returned from JobEventServcie.getJobStateFromBeginning() in the JobState are re-created from events. We also learned that these events are things that are put into Kafka from Mothership functions that look similar to RecordJobOp(). So we've determined that the got_messages parameter is got_state.messages ==> re-created from events in Kafka.

How about the expected_messages parameter? That's the large block of code we saw above (broken into multiple lines for readability):

check_job_messages_from_events(
    got_state.messages,
    WI.MessagingService(auth_token) { |cli|
        cli.get(job_id, 0, M::Messaging::GetMessageOptions.new({canHandlePendingDocuments: true, limit: -1}))
     }.messages
 )

Recall also that we traced this back to MessagingHandler.Get(). When we look through the code of that function, we learn that the result is fetched from the database. This is in contrast to JobEventService.getJobStateFromBeginning(), which we learned gets its result not from the database, but from events in Kafka.

Now we're in a position to look at the exact line that failed and understand what it means:

expect(got_messages.size).to eq (expected_messages.size)

It means that we compared the messages on a job from two places and they were different. The first (got_messages) is the actual result from rebuilding the list from events. The second (expected_messages) is the cannonical, true, proper list of messages, drawn from the database.

Why don't the numbers match? One possibility is that the logic that reconstructs the list from events is broken. But that's unlikely, because it worked before you made your change, and you didn't change the reconstruction code.

The other possibility is that we're not putting events into Kafka properly: either sending too many or too few, or perhaps the events themselves are badly structured.

Summary

So what we learned is that there's a comparison between two different ways of viewing a job and its data. We treat one way as true. That's when we fetch the data directly from the database. The other way, we want to verify against the first way. That's the result of recreating job state from RecordJobOp()-like functions.

Your job now is to figure out why the events aren't being produced properly. I suggest you do the following:

  1. Look at the job-reconstruction code in Mothership
  2. Determine what kinds of events are used in re-creating the data in question (in this case, messages)
  3. Look for places in Mothership (e.g. controllers, models, lib, services) where we record a job operation using this event type
  4. Attempt to understand why the events are being recorded incorrectly
@pramoddhone
Copy link

I will looking into it and will update on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment