Tracing framworks have the following components:
- A client library that is used directly in the application code for instrumenation
- A sidecar application that buffers data from the client library and passes it along
- A dedicated aggregation and reporting server that accumulates and processes data from sidecar applications
From a conceptual standpoint, tracing has the following notions:
- Span - represents a unit of work, e.g. all the processing required by a service to fulfill a single request
- Trace - a collection of spans that represents the end-to-end completion of all the work in the application, from the client all the way through to the deepest service in the system
In order to connect these two notions, we rely on the "propagation" of span and trace data:
- Between services that fulfill requests for one another
- From the services themselves back to the central aggregation server
We want to instrument Beacon with a distributed tracing solution. The clearest two options are Zipkin and Jaeger.
Both Jaeger and Zipkin easily support services that communicate exclusively via HTTP. They use HTTP headers to propagate the span and trace data along with each request. Unfortunately, neither currently have direct support for Kafka-based communications.
What we need to do is build on an existing instrumentation library in one of them and create a way to propagate span and trace data along with our Kafka messages. Depending on the level of support provided by the library, we may also have to take on more responsibilities not already handled by the client library.
Zipkin is heavily oriented towards use of HTTP headers for propagation. It even has a special name for its use of headers: "Zipkin B3 Propagation". The closest available instrumentation library is zipkin-go. We can't use the API of this library to extend it to support Kafka-based span data propagation. Instead, we would need to fork and directly modify the code. Doing this would require a detailed understanding of the Zipkin structures and protocols.
While we wouldn't be writing the library from scratch, it's worth considering how much work library-writing is considered to be by the Zipkin people themselves. Creating a new Zipkin instrumentation library is, in the words of the documentation, "an advanced topic". They consider it enough of an undertaking that they want to hear from anyone considering it on their Gitter channel.
I'm holding off on an estimate on this for two reasons:
- I believe it would be significantly more costly than Jaeger, so why bother?
- Even developing a plan would require a great deal of research
Jaeger implements the OpenTracing API and is more-or-less the de-facto implementation. We can use that API to "inject" and "extract" span data into and out of Kafka messages.
OpenTracing has a Golang tutorial to explain the concepts and demonstrate an example instrumentation. I asked in the Jaeger Gitter channel whether we would need to write a new client library to support Kafka. The response was a no, and I was pointed in the direction of instrumentation documentation. Based on that documentation, I believe we could use the Jaeger library to instrument Kafka clients by implementing an OpenTracing "Carrier".
As per the documentation:
The
carrier
is an abstraction over the underlying RPC framework. For example, a carrier forTextMap
format is an interface that allows the tracer to write key-value pairs viaSet(key, value)
function, while a carrier forBinary
format is simply anio.Writer
.
-
Start up a Jaeger server locally using the docker image
-
Determine the relevant environment variables for configuring a Jaeger client
- Refer to the config documentation here
- For more information, see these examples
-
Export all necessary environment variables in the terminal session running Beacon
- Compare with the values suggested by the docker image documentation
-
For every service in Beacon, provide it with a Jaeger tracer on start up, following one of these examples
-
Create
KafkaConsumerCarrier
that implements theopentracing.TextMapWriter
interface- Must have a function:
ForeachKey(handler func(key, val string) error) error
- Has one field:
ConsumerMessage
of typesarama.ConsumerMessage
ForeachKey()
iterates over theHeaders
field of ConsumerMessage and callshandler
on each element- Ensure that at the beginning of handling of every message, we extract the tracing data from it by running:
spanCtx, err := tracer.Extract(opentracing.TextMap, tracingCarrier)
- see the Tracer interface documentation for
Extract()
- see this tutorial for more on extracting span contexts
- see the Tracer interface documentation for
- Then make sure we create a new
context.Context
with the span's data in it, and pass it along for the duration of the message handlingspan := tracer.StartSpan("handle-message-XXX", opentracing.FollowsFrom(spanCtx), ...)
ctx = opentracing.ContextWithSpan(span)
- Close the span when the handling is complete
defer span.Close()
- Must have a function:
-
Create
KafkaProducerCarrier
that implements theopentracing.TextMapReader
interface-
Must have a function:
Set(key, value)
-
Has one field:
ProducerMessage
of typesarama.ProducerMessage
-
Set()
sets key-value pairs in theHeaders
field of ProducerMessage -
Ensure that before putting a message into Kafka, we inject tracing data into it by running:
err := tracer.Inject(opentracing.SpanFromContext(ctx), openTracing.TextMapReader, KafkaProducerCarrier)
-
-
Add
Log
andTag
metadata to traces in accordance with OpenTracing Semantic Conventions best practices
I haven't validated them by looking at Beacon or playing around with Jaeger. It may take some work to either make the assumptions true or work around their falsehoods.
Assumptions:
- Every Beacon sub-service has a distinct process and a distinct
main()
func- This is required by OpenTracing's pattern of using a global tracer variable. If multiple Beacon sub-services shared the same Golang context, then they would also share the same tracer.
- From the start of processing a message consumed in Kafka until the time that we've fully dispatched it (including producing subsequent messages for downstream processing), we have a single
ctx
of typecontext.Context
available to us- i.e. There is a one-to-one mapping between messages being handled and
ctx
instances. For example, we never handle multiple messages in a loop or batch using the samectx
- i.e. That
ctx
is maintained and passed along from function to function - This is required by the need to pass the
span
object along the entire duration of message processing so it can be utilized by aKafkaProducerCarrier
any time a new message is produced for downstream processing
- i.e. There is a one-to-one mapping between messages being handled and
- The handling of each message is encapsulated within a single function
- i.e. There is a one-to-one mapping between messages being handled and invocations of the handling function. For example, we never handle multiple messsages in a loop or batch without starting a new function for each one
- i.e. That handler function for each message is complete when-and-only-when the message handling has completed
- This is required by the need to
Close()
each span when processsing is complete. Without proper closure, the span is considered to still be in progress.
- 0.50-1.00 sprints :: Ensure all assumptions are true by making them true (or working around them)
- 0.25-0.25 sprints :: Start up a local Jaeger docker image
- 0.25-0.50 sprints :: Add and configure Jaeger tracers for each sub-service in Beacon
- 0.25-0.50 sprints :: Create a
KafkaConsumerCarrier
- 0.25-0.50 sprints :: Create a
KafkaProducerCarrier
- 0.25-0.25 sprints :: Add
Log
andTag
metadata to traces in accordance with OpenTracing Semantic Conventions best practices
Total :: 1.75-3.00 sprints
In lesson 2 of the opentracing tutorial Yuri writes:
Another standard reference type in OpenTracing is
FollowsFrom
, which means the rootSpan is the ancestor in the DAG, but it does not depend on the completion of the child span, for example if the child represents a best-effort, fire-and-forget cache write.
So we should use FollowsFrom
instead of ChildOf
when creating spans from parents, because we're using Kafka.