Skip to content

Instantly share code, notes, and snippets.

@lmolkova
Last active April 3, 2024 07:08
Show Gist options
  • Save lmolkova/6cd1f61f70dd45c0c61255039695cce8 to your computer and use it in GitHub Desktop.
Save lmolkova/6cd1f61f70dd45c0c61255039695cce8 to your computer and use it in GitHub Desktop.
OpenTelemetry Tracing APIs vs .NET Activity/DiagnosticSource

Tracing API Comparison

A distributed trace is a set of events, triggered as a result of a single logical operation, consolidated across various components of an application. A distributed trace contains events that cross process, network and security boundaries. A distributed trace may be initiated when someone presses a button to start an action on a website - in this example, the trace will represent calls made between the downstream services that handled the chain of requests initiated by this button being pressed.

Contract difference

OpenTelemetry Tracing API is a very strict contract that enables tracing signal (not debugging or profiling). This contract is the same for all kinds of libraries and tracing backends and includes several related concepts:

  • span creation, required and optional properties
  • sampling
  • exporting
  • noop behavior in absence of tracing implementaiton
  • extra information certain types spans should include (e.g. spans for http calls).

DiagnosticSource contract is loose. It lets library and listener together decide when Activities should be created, which information should be passed, how to sample, how to convert rich payloads into exportable telemetry events, etc. DiagnosticSource can be used for basic tracing or deeper profiling.

Activity is inspired by OpenTracing (OpenTelemetry parent) and is very similar to Span.

Open Telemetry .NET Analog Comments
Tracer Diagnostic Source Somewhat similar Tracer is used to create spans and carry common components (exporter, sampler, etc). DiagnosticSource notifies listener about Activities and allows to control their flow.
Span Activity Similar Represents logical operation
SpanContext Properties on Activity (TraceId, SpanId, etc) Context that is propagated over the wire
SpanBuilder Set* Methods on the Activity Helper that configures span/activity properties
Sampler - Configurable algorithm that makes sampling decision on span when it is being started.
Exporter - Configurable pipeline to deliver data from the user process to the tracing backend of choice
Propagation formats Activity.Id implements http propagation, users/libs can leverage ActivityTraceId/SpanId for binary propagation protocol-specific encoding for context

Detailed Activity vs Span delta

Activity Span Comments Importance
OperationName Name Different. For Activity it is DiagnosticSource event prefix (like HttpIn), for OT this is more specific to the event like http path High, this is used in every instrumentation and listener, Azure SDK uses it
Current tracer.CurrentSpan
Parent - Reference to parent if it is in-proc
ParentId - Unique parent Activity Id (encoded for HTTP propagation)
Id - Unique Activity Id encoded for HTTP propagation
RootId TraceId legacy Id that is common for all logical operation in the same trace
TraceStateString Span.SpanContext.Tracestate Tracestate as per W3C trace-context spec
Tags Attributes Key-value pairs that augment Activity/Span like http url. Difference Activity supports string, string, OT supports string keys and string, long, double, bool values Low
Baggage DistributedContext API (not even tracing) Context that propagates to other services High. 1P users have high interest in this
- Status The result of operation: Success, failure, failure kind High, this is used in every instrumentation and listener, Azure SDK uses it
- Links Represent relationship to multiple other span trees (useful in batch processing scenarios) High Azure SDK (EventHub, ServiceBus) need it, Azure services need it for batchin scenarios
- Events Additional events that happen in scope of span (receiving chunk of data, or attaching a log message) Low
- Kind Useful for UX: service for incoming request, client for outgoing, internal for logical operations. High, this is used in every instrumentation and listener, Azure SDK uses it
TraceId Context.TraceId Same. Trace Id as per W3C trace-context spec
SpanId Context.SpanId Same. Span Id as per W3C trace-context spec
ParentSpanId ParentSpanId Same. Span Id of the parent Span/Activity
Recorded IsRecordingEvents Same. Indicates if Activity/Sapn is sampled in or out
ActivityTraceFlags Context.TraceOptions Same. Trace flags as per W3C trace-context spec
Duration Duration Same
StartTimeUtc StartTime Same
APIs to control Id format -

Key difference in behavior

Notifications

  • OpenTelemetry: When span ends, it is automatically scheduled for exporting. Library does not need to call anything else.
  • Activity/DiagnosticSource: It is library responsibility to accompany each Activity start/stop with DiagnosticSource event.

Noop vs tracing

  • OpenTelemetry: Library code does not know or care if user enabled instrumentation. OpenTelemetry is noop if user did not bring the implementation package and enables tracing if user did bring it.
  • Activity/DiagnosticSource: It is library responsibility to understand if there is a listener and if event is enabled and if this request is interesting, etc and behave differently

Sampling

  • OpenTelemetry: When span starts, OpenTelemetry makes sampling decision. Sampler is configurable by user or library. Library may check for Span sampling decision and augment span with more information only if it is sampled in. OpenTelemetry does some internal optimizations for not-sampled spans (e.g. they are not sent for exporting). Typical sampling algorithms are part of the contract and consitent per each span in the same trace.
  • Activity/DiagnosticSource: Sampling flag is available on Activity, but there is no other contract (when, who, how, etc). Setting/updating it and making sampling decisions is the listener responsibility.

Implicit vs explicit context propagation

  • OpenTelemetry: implicit propagation is a choice. By default, propagation in explicit.
  • Activity/DiagnosticSource: propagation is always implicit. There is no choice (unless someone wants to hack it).

Augmenting spans

  • OpenTelemetry: library defines which information to provide and sets attributes on span as string-string (long, bool, double) pairs. It should follow common conventions for well-known things like HTTP, gRPC or DB calls. No rich payloads are involved.
  • Activity/DiagnosticSource: Library gives listener everything it knows: requests/responses payloads and leaves it up to listener to extract what it needs. Library can also set Tags on Activity to stamp information that everyone needs and leave it up to listener to decide whether they need anything else.

Suggested areas to focus on

1. [P0] Strict vs loose contract

Check out Contract difference section for more details. Is DiagnosticSource a good way to instrument a library? (It seems Activity is). How strict we want contract to be?

2. [P0] OT Span vs .NET Activity

Activity and Span are the same. Can we avoid having OT Spans in the first place?

3. [P1] OT APIs usability and perf

Review OT APIs (at least those that should survive after p1 and p2) and influence good design and room for perf improvements now

4. [P1] Extend Activity APIs

  • decouple Activity.OperationName from DiagnostiSource events
  • add Activity.Status
  • add Links

5. [P1] Activity Baggage vs OT Distributed Context

  • decouple addition context propagation from Activity

6. [P2] Other Activity vs Span API diff

  • Events
  • long, bool, double attributes

7. [P2] Metrics API vs .NET Event Counters

Basic scenario example

Span

Let's assume library creates this span. Library depends on OpenTelemetry.Abstractions package.

private static ITracer tracer = Tracing.Tracer;

public static void BasicSpan()
{
    var span = tracer
        .SpanBuilder("my span") // set span name and other properties
        .StartSpan(); 
     
    using (var scope = tracer.WithSpan(span))
    {
        // do stuff
        Console.WriteLine(tracer.CurrentSpan.Context.TraceId);
    }
    
    span.End();
}

Basic Activity with comparable features

private static readonly diagnosticListener = new DiagnosticListener("test");

public static void BasicActivityWithEvents()
{
    Activity activity = null;
    if (diagnosticListener.IsEnabled())
    {
        if (diagnosticListener.IsEnabled("my activity"))
        {
            activity = new Activity("my activity");
            diagnosticListener.StartActivity(activity, null);
        }

        // do stuff

        if (activity != null)
        {
            diagnosticListener.StopActivity(activity, null);
        }
    }
    else
    {
        // do stuff
    }
}

Exporting spans

Exporting API is subject to change.

static async Task Main(string[] args)
{
    Tracing.SpanExporter.RegisterHandler("ConsoleExporter", new ConsoleExporter());

    BasicSpan();
}

class ConsoleExporter : IHandler
{
    public Task ExportAsync(IEnumerable<SpanData> spanDataList)
    {
        foreach (var span in spanDataList)
        {
            Console.WriteLine($"[{span.StartTimestamp:o}] Exporting span={span.Name}, duration={span.EndTimestamp - span.StartTimestamp}, status={span.Status} with context traceId={span.Context.TraceId} spanId={span.Context.SpanId} parentId={span.ParentSpanId}");
        }

        return Task.CompletedTask;
    }
}

Typical instrumentation

Let's add some attributes, propagate context over the wire and set status.

Span

public static void TypicalSpan()
{
    var span = tracer
        .SpanBuilder("my span")
        .SetSpanKind(SpanKind.Client)
        .StartSpan();

    using (var _ = tracer.WithSpan(span))
    {
        if (span.IsRecordingEvents)
        {
            span.SetAttribute("component", "example");
            span.SetAttribute("target", "my-service");
        }

        try
        {
            // this is noop check
            if (span.Context.IsValid)
            {  
                tracer.TextFormat.Inject(
                    span.Context,
                    message,
                    (msg, headerName, headerValue) => msg[headerName] = headerValue);
            }
            
            // send message
        }
        catch (Exception e)
        {
            span.Status = Status.Unknown.WithDescription(e.ToString());
            throw;
        }
        finally
        {
            span.End();
        }
    }
}

Activity

public static void TypicalActivity()
{
    var diagnosticListener = new DiagnosticListener("test");

    Activity activity = null;
    if (diagnosticListener.IsEnabled())
    {
        if (diagnosticListener.IsEnabled("my activity"))
        {
            activity = new Activity("my activity");
            activity.AddTag("component", "example");
            activity.AddTag("target", "my-service");
            diagnosticListener.StartActivity(activity, new {Message = message});
        }

        bool result = true;
        try
        {
            if (activity != null)
            {
                message["traceparent"] = activity.Id;
            }
            // send message
        }
        catch (Exception e)
        {
            result = false;
            diagnosticListener.Write("exception", new {Exception = e, Message = message});
            throw;
        }
        finally
        {
            if (activity != null)
            {
                diagnosticListener.StopActivity(activity, new {Message = message, Result = result});
            }
        }
    }
    else
    {
        // do stuff
    }
}

Links

static async Task ReadAndProcessAsync(PartitionReceiver eventHubReceiver)
{
    var tracer = Tracing.Tracer;

    while (true)
    {
        IEnumerable<EventData> messages = await eventHubReceiver.ReceiveAsync(5);

        var builder = tracer.SpanBuilder("process message");
        foreach (EventData message in messages)
        {
            builder.AddLink(message.ExtractActivity());
        }

        var span = builder.StartSpan();
        foreach (EventData message in messages)
        {
            // process messages
        }
        span.End();
    }
}

Metrics API

OpenTelemetry allows to record raw measurements or metrics with predefined aggregation and set of labels.

Recording raw measurements using OpenTelemetry API allows to defer to end-user the decision on what aggregation algorithm should be applied for this metric as well as defining labels (dimensions). It will be used in client libraries like gRPC to record raw measurements "server_latency" or "received_bytes". So end user will decide what type of aggregated values should be collected out of these raw measurements. It may be simple average or elaborate histogram calculation.

Recording of metrics with the pre-defined aggregation using OpenTelemetry API is not less important. It allows to collect values like cpu and memory usage, or simple metrics like "queue length".

Raw metrics collection is similar to .NET PerformanceCounters or EventCounter

Distributed Context API

Labels other telemetry (metrics, traces) with user-defined context that flows across process boundaries.

This is similar to Activity.Baggage, but not related to tracing, i.e. could be used with metrics only. Another analog is ILogger scopes, but distributed

Resources API

Resource captures information about the entity for which telemetry is recorded. For example, metrics exposed by a Kubernetes container can be linked to a resource that specifies the cluster, namespace, pod, and container name.

Resource may capture an entire hierarchy of entity identification. It may describe the host in the cloud and specific container or an application running in the process.

Logging API

Future. Probably does not make much sense in .NET.

@vanwx
Copy link

vanwx commented Jul 17, 2023

Thank you ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment