Skip to content

Instantly share code, notes, and snippets.

@lmolkova
Last active February 22, 2024 02:57
Show Gist options
  • Save lmolkova/b9004307a09be788af04f05ebe22ad3c to your computer and use it in GitHub Desktop.
Save lmolkova/b9004307a09be788af04f05ebe22ad3c to your computer and use it in GitHub Desktop.
Metrics in Azure SDK for Java

Metrics in Azure SDK for Java

The immediate goal is to report metrics from Azure messaging SDKs (EventHubs and ServiceBus) to help customers detect and investigate configuration issues, performance bottlenecks, application and SDK bugs.

It can be broken down into smaller goals:

  • define metrics essential for messaging scenarios
  • define Metrics API in azure-core
  • metrics plugin implementations

Scenarios

User scenarios

We expect users to be interested to know how many messages were received, processed, checkpointed; what's the delay of messages consumers receive; batch size, success rate of network operations and other key metrics we're going to define. Some of these metrics can be calculated from traces, but not all of them and we're going to focus on the latter ones. Metrics would provide more performant, cheap and production-ready solution than tracing.

We expect users to have one or another metrics solution in their app. Based on Spring One survey, 90%+ of attendees use an APM tool (for logs, metrics, or traces), out of them, ~20%+ use Prometheus, ~30% use Azure Monitor.

SDK scenarios

  • Supportability: our TSGs should include steps that ask users to check metrics emitted by SDK instead of verbose logs. It'd help narrow down problems without reconfiguring logging and reproducing it.
  • Stress tests: assuming SDKs report metrics, stress tests would be just a regular user of this feature. If we see an issue in stress test run, we can use built-in metrics to investigate it in the same way as users would.

Usage beyond messaging SDKs

  • HTTP-based SDK: limited and can be done in core. Can be done automagically in tracing calls before sampling.
  • thick clients: CosmosDB (already uses Micrometer in Java, has similar ask for .NET)

Beyond Java

  • .NET: OTel Metrics are included in DiagnosticSource 6.0.
  • Python: OTel metrics API are in RC
  • JS: OTel metrics are in development
  • Go: Alpha
  • C++: Alpha

The proposal here is to polish scenario in Java where we have a partner ask and learn from it before doing any work in other languages.

Metric solution

We're going to pick the solution that

  • works for Spring Cloud
  • works with Azure Monitor and Prometheus and
    • compatible with variety of other APM vendors
    • has a fair amount of existing instrumentations

Summary

  • we'll have Meter API abstractions in azure-core
  • Provide OTel-based implementation for Meter APIs
  • Spring will keep using micrometer and will provide Micrometer-based implementation similar to this one sample

OTel vs Micrometer analysis

Metrics API

Closely follow OTel metrics API, Micrometer APIs are quite similar. Do only a subset, perf is more important than convenience:

Naming choice: Azure prefix is added to avoid collision with OTel Meter.

Example of usage in Client libraries

// Create attributes with possible error status could be created upfront, usually along with client instance.
Map<String, Object> successAttributes = createAttributes("http://service-endpoint.azure.com", false);
Map<String, Object> errorAttributes = createAttributes("http://service-endpoint.azure.com", true);

// Create instruments for possible error codes. Can be done lazily once specific error code is received.
AzureLongCounter successfulHttpConnections = defaultMeter.createLongCounter("az.core.http.connections",
    "Number of created HTTP connections", null, successAttributes);

AzureLongCounter failedHttpConnections = defaultMeter.createLongCounter("az.core.http.connections",
    "Number of created HTTP connections", null, errorAttributes);

boolean success = false;
try {
    success = connect();
} finally {
    if (success) {
        successfulHttpConnections.add(1, currentContext);
    } else {
        failedHttpConnections.add(1, currentContext);
    }
}

Users apps

Basic

// configure OpenTelemetry SDK as usual and register global configuration
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
    .registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
    .build();

OpenTelemetrySdk.builder()
    .setMeterProvider(meterProvider)
    .setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
    .buildAndRegisterGlobal();

// configure Azure Client, no metric configuration needed, client will use global OTel configuration
AzureClient sampleClient = new AzureClientBuilder()
    .endpoint("https://my-client.azure.com")
    .build();

// use client as usual, if it emits metric, they will be exported
sampleClient.methodCall("get items", Context.NONE);

Custom configuration and along with tracing

// configure OpenTelemetry SDK as usual
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
    .registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
    .build();

OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
    .setMeterProvider(meterProvider)
    .setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
    .build();

// Pass OTel meterProvider to MetricsOptions - it will be used instead of implicit global singleton.
MetricsOptions customMetricsOptions = new MetricsOptions()
    .setProvider(meterProvider);

// configure Azure Client, no metric configuration needed
AzureClient sampleClient = new AzureClientBuilder()
    .endpoint("https://my-client.azure.com")
    .build();

Span span = openTelemetry.getTracer("azure-core-samples")
    .spanBuilder("doWork")
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    // do some work

    // Current context flows to OpenTelemetry metrics and is used to populate exemplars
    String response = sampleClient.methodCall("get items");
    // do more work
}

span.end();

Messaging Metrics

Prior art

  • Current EventHubs broker metrics
    • Requests: incoming/outgoing, success, throttles
    • Messages: incoming, outgoing, captured
    • Bytes: incoming, outgoing, size of EH
  • Track -1 ServiceBus performance counters
    • SendMessage/ReceiveMessage/CompleteMessage/AcceptSeasstion/CancelScheduled
      • count (error/success)
      • rate (error/success)
      • duration
      • per namespace and per entity
    • Exceptions: count and rate (by type)
    • TokenAcquisition: rate (success/error), latency
    • Pending ReceiveMessage/AcceptMessageSession/AcceptMessageSessionByNamespace: count
    • EventProcessor process: latency, batch size
    • Connections: reset count (per entity), redirect count
    • Prefetch queue size and depth(?) per entity
    • Throughput (in/out): byte rate (per ns/entity)
  • XBox EventHubs Perf Counters - internal
    • Blob offset store: time since last offset flush
    • Producer (per topic):
      • Latency
      • Throughput
      • Request rate, retry rate, timeout rate, error receive rate
      • Transmission error rate
      • Event rate per partition
    • Buffered producer
      • Queue size
      • Queue full rate
      • Enqueue rate
      • Batch size
      • Event Time in queue
    • Consumer (per topic, per partition)
      • Lag (which is two other metrics)
        • last received (seqNo) - last published (seqNo)
        • Receive rate (success/error)
        • Producer-to-consumer latency (receive timestamp - enqueued-time) - approx?
        • Seconds to Zero: Lag * consumption rate
        • Consumption rate
        • Consumer queue size
        • Delivery queue: size, incoming rate, delivery rate, delivery failure rate
  • Current Kafka metrics
    • Producer:
      • batch size, splits
      • throughput: outgoing bytes, compression rate
      • metadata age
      • throttle time
      • record: errors, rate, time in send buffer, retries, size
      • request: rate, size, active
      • response rate, bytes
    • Connections: close, creation, io stats
    • Consumer:
      • fetch: latency, rate, size, throttle time, counts
      • records: rate, lag, batch size
      • bytes consumed
      • consumer groups: partitions, commit latency, rate, join rate, etc
  • Kafka proposal
    • intent: Kafka client library internals observability
    • metrics:
      • connections creations/active, errors
      • requests: rate, rtt, errors
      • internal queue latency, size
      • client io wait time
      • producer queue size, bytes
      • consumer
        • poll interval, latency, last time
        • consumer queue count, bytes
        • consumer group: errors, rebalance, partitions counts
  • DataDog article on Key RabbitMq metrics
    • broker side, mostly irrelevant
  • DataDog article on Kafka metrics
    • producer: response/response rate, latency, io wait time, batch size, throughput produced and batch-compression rate
    • consumer: record lag records rate, fetch rate, throughput consumed
  • OTel proposal, early WIP

EventHubs Metrics Proposal

Report metrics that are useful for customers when operating applications with EventHubs or ServiceBus. We can add more to expose internals later.

All

Metric Type Comment
Last offset on broker counter [TODO] Opt-in, offset of the last message published successfully
Last sequence number on broker counter [TODO] Opt-in, sequence number of the last message published successfully
AMQP link: errors counter link errors counter by error code
AMQP session: errors counter session errors counter by error code
AMQP Connections: active up-down-counter Number of active connections; available on broker, not per client process
AMQP Connections: creations counter Number of created connections; available on broker, not per client

Dimensions:

  • Namespace
  • Entity
  • EntityPath

Both can only be reported as opt-in metrics (additional charges apply), customers would be expected to opt in on either producer or consumer.

Producer

Metric Type Comment
Send: duration histogram Number of milliseconds send ProducerClient.Send call takes with all retries
Send: messages in batch counter Number of messages sent per Producer.Send call
Send: bytes in batch histogram Number of bytes sent per Producer.Send call
AMQP link: send duration histogram Response time (in milliseconds) of AMQP request

Dimensions:

  • Namespace
  • Entity
  • EntityPath
  • Error code (or success)

Notes:

  • can calculate attempts metrics, e.g. avg attemps # = count(link_duration)/count(send_duration). If it's proven to be insufficient, we can come up with a better one.

Consumer

Metric Type Comment
AMQP: messages received counter Number of messages received per Consumer.Receive call
AMQP: credits requested counter Number of credits requested from broker.
Processor: duration histogram available on broker, not per client
Processor: error handler counter Error Handler Invocations
Checkpoint: duration histogram available on broker, not per client
Checkpoint: last offset checkpointed counter
Checkpoint: last sequence number checkpointed counter

Dimensions:

  • Namespace
  • Entity
  • Error code (or success)
  • EntityPath
  • Consumer GroupId

It will allow following views with slicing, dicing and filtering per any dimension

  • Histogram: count, rate, percentiles, avg, max
  • Gauge: count, rate, max, avg, sum
  • Counters: count, rate, total, avg, max

...

[WIP] Spec: https://gist.github.com/lmolkova/489a2b280b8fa68e4c3780c2afaa3b39

OpenTelemetry

  • status (5/9/2022): API and SDK stable as of 1.14
  • OpenTelemetry micrometer plugin: alpha
  • Application Insights agent: supports otel metrics in 3.3.0-beta release
  • Azure Monitor exporter: does not support metrics - TBD - roadmap
  • Other exporters: OTLP - stable, Prometheus - alpha
  • OTel exporter registry - here're the backends that support metrics (diff with Micrometer in bold): AWS CloudWatch, Datadog, Dynatrace, Elastic, Graphite, Influx, Instana, JMX, NewRelic, Stackdriver, Sumologic, Logzio, Honeycomb, Prometheus, SignalFx, StatsD (as a source), Wavefront
  • OTel instrumentations registry - enormous list both traces (and metrics from traces).
  • Semantics: OTel attempts to standartize metrics, dimensions and attribute names accross languages for generic scenarios (e.g. messaging)

Micrometer

  • Status: stable
  • Application Insights agent: supports micrometer (stable)
  • OpenTelemetry Java agent: supports micrometer (stable)
  • OpenTelemetry micrometer plugin: alpha
  • Micrometer backend registry (diff with OTel in bold) - AppOptics, Atlas, AWS CloudWatch, Datadog, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, NewRelic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront.
  • Micrometer instrumentations - Spring Boot, JVM, Cache, OkHttp, Jetty and Jersey.
  • Micrometer does not have guidance or standards on attributes for generic scenarios

OpenTelemetry has alot of instrumentations available in OTel, supporting it would mean minimizing future list of dependencies for users. Micrometer is more stable solution though.

OTel and Micrometer provide similar sets of Meters (sync and call-back based): counters, gauge, histogram.

  • OTel supports exemplars of metrics that allows to see examples of traces corresponding to specific measurement
  • OTel allows to efficiently and conveniently use dynamic attribute values

image

Plan

  • Custom attributes (OTel baggage, Micrometer tags)
    • OTel baggage is not supported YET
    • Micrometer: registry.config().commonTags("custom-tag", "foo");
  • Core changes and API review: done and released
  • AMQP core changes: in progress
  • ServiceBus changes
  • EventHubs changes
  • release otel plugin
  • micrometer to samples
  • document our metrics conventions
  • AzMon review
  • Blog
  • Update EH/SB tsgs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment