Estimating Instrumentation Effort for Jaeger and Zipkin

Terminology

Tracing framworks have the following components:

A client library that is used directly in the application code for instrumenation
A sidecar application that buffers data from the client library and passes it along
A dedicated aggregation and reporting server that accumulates and processes data from sidecar applications

From a conceptual standpoint, tracing has the following notions:

Span - represents a unit of work, e.g. all the processing required by a service to fulfill a single request
Trace - a collection of spans that represents the end-to-end completion of all the work in the application, from the client all the way through to the deepest service in the system

In order to connect these two notions, we rely on the "propagation" of span and trace data:

Between services that fulfill requests for one another
From the services themselves back to the central aggregation server

Goal

We want to instrument Beacon with a distributed tracing solution. The clearest two options are Zipkin and Jaeger.

Both Jaeger and Zipkin easily support services that communicate exclusively via HTTP. They use HTTP headers to propagate the span and trace data along with each request. Unfortunately, neither currently have direct support for Kafka-based communications.

What we need to do is build on an existing instrumentation library in one of them and create a way to propagate span and trace data along with our Kafka messages. Depending on the level of support provided by the library, we may also have to take on more responsibilities not already handled by the client library.

Zipkin Instrumentation Effort Estimate

Zipkin is heavily oriented towards use of HTTP headers for propagation. It even has a special name for its use of headers: "Zipkin B3 Propagation". The closest available instrumentation library is zipkin-go. We can't use the API of this library to extend it to support Kafka-based span data propagation. Instead, we would need to fork and directly modify the code. Doing this would require a detailed understanding of the Zipkin structures and protocols.

While we wouldn't be writing the library from scratch, it's worth considering how much work library-writing is considered to be by the Zipkin people themselves. Creating a new Zipkin instrumentation library is, in the words of the documentation, "an advanced topic". They consider it enough of an undertaking that they want to hear from anyone considering it on their Gitter channel.

I'm holding off on an estimate on this for two reasons:

I believe it would be significantly more costly than Jaeger, so why bother?
Even developing a plan would require a great deal of research

Jaeger Instrumentation Effort Estimate

Jaeger implements the OpenTracing API and is more-or-less the de-facto implementation. We can use that API to "inject" and "extract" span data into and out of Kafka messages.

OpenTracing has a Golang tutorial to explain the concepts and demonstrate an example instrumentation. I asked in the Jaeger Gitter channel whether we would need to write a new client library to support Kafka. The response was a no, and I was pointed in the direction of instrumentation documentation. Based on that documentation, I believe we could use the Jaeger library to instrument Kafka clients by implementing an OpenTracing "Carrier".

As per the documentation:

The carrier is an abstraction over the underlying RPC framework. For example, a carrier for TextMap format is an interface that allows the tracer to write key-value pairs via Set(key, value) function, while a carrier for Binary format is simply an io.Writer.

These are the more concrete steps that must be taken to instrument Beacon with Jaeger tracing

Start up a Jaeger server locally using the docker image
Determine the relevant environment variables for configuring a Jaeger client
- Refer to the config documentation here
- For more information, see these examples
Export all necessary environment variables in the terminal session running Beacon
- Compare with the values suggested by the docker image documentation
For every service in Beacon, provide it with a Jaeger tracer on start up, following one of these examples
Create KafkaConsumerCarrier that implements the opentracing.TextMapWriter interface
- Must have a function:
  - ForeachKey(handler func(key, val string) error) error
- Has one field:
  - ConsumerMessage of type sarama.ConsumerMessage
- ForeachKey() iterates over the Headers field of ConsumerMessage and calls handler on each element
- Ensure that at the beginning of handling of every message, we extract the tracing data from it by running:
  - spanCtx, err := tracer.Extract(opentracing.TextMap, tracingCarrier)
    - see the Tracer interface documentation for Extract()
    - see this tutorial for more on extracting span contexts
- Then make sure we create a new context.Context with the span's data in it, and pass it along for the duration of the message handling
  - span := tracer.StartSpan("handle-message-XXX", opentracing.FollowsFrom(spanCtx), ...)
  - ctx = opentracing.ContextWithSpan(span)
- Close the span when the handling is complete
  - defer span.Close()
Create KafkaProducerCarrier that implements the opentracing.TextMapReader interface
- Must have a function:
  - Set(key, value)
- Has one field:
  - ProducerMessage of type sarama.ProducerMessage
- Set() sets key-value pairs in the Headers field of ProducerMessage
- Ensure that before putting a message into Kafka, we inject tracing data into it by running:
  - err := tracer.Inject(opentracing.SpanFromContext(ctx), openTracing.TextMapReader, KafkaProducerCarrier)
Add Log and Tag metadata to traces in accordance with OpenTracing Semantic Conventions best practices

These are the assumptions that are made by the above steps

I haven't validated them by looking at Beacon or playing around with Jaeger. It may take some work to either make the assumptions true or work around their falsehoods.

Assumptions:

Every Beacon sub-service has a distinct process and a distinct main() func
- This is required by OpenTracing's pattern of using a global tracer variable. If multiple Beacon sub-services shared the same Golang context, then they would also share the same tracer.
From the start of processing a message consumed in Kafka until the time that we've fully dispatched it (including producing subsequent messages for downstream processing), we have a single ctx of type context.Context available to us
- i.e. There is a one-to-one mapping between messages being handled and ctx instances. For example, we never handle multiple messages in a loop or batch using the same ctx
- i.e. That ctx is maintained and passed along from function to function
- This is required by the need to pass the span object along the entire duration of message processing so it can be utilized by a KafkaProducerCarrier any time a new message is produced for downstream processing
The handling of each message is encapsulated within a single function
- i.e. There is a one-to-one mapping between messages being handled and invocations of the handling function. For example, we never handle multiple messsages in a loop or batch without starting a new function for each one
- i.e. That handler function for each message is complete when-and-only-when the message handling has completed
- This is required by the need to Close() each span when processsing is complete. Without proper closure, the span is considered to still be in progress.

Jaeger Estimate breakdown

0.50-1.00 sprints :: Ensure all assumptions are true by making them true (or working around them)
0.25-0.25 sprints :: Start up a local Jaeger docker image
0.25-0.50 sprints :: Add and configure Jaeger tracers for each sub-service in Beacon
0.25-0.50 sprints :: Create a KafkaConsumerCarrier
0.25-0.50 sprints :: Create a KafkaProducerCarrier
0.25-0.25 sprints :: Add Log and Tag metadata to traces in accordance with OpenTracing Semantic Conventions best practices

Total :: 1.75-3.00 sprints

Jaeger instrumentation span relation detail:

In lesson 2 of the opentracing tutorial Yuri writes:

Another standard reference type in OpenTracing is FollowsFrom, which means the rootSpan is the ancestor in the DAG, but it does not depend on the completion of the child span, for example if the child represents a best-effort, fire-and-forget cache write.

So we should use FollowsFrom instead of ChildOf when creating spans from parents, because we're using Kafka.

sheac/jaeger-zipkin-instrumentation-estimates.md