Metrics in Azure SDK for Java

The immediate goal is to report metrics from Azure messaging SDKs (EventHubs and ServiceBus) to help customers detect and investigate configuration issues, performance bottlenecks, application and SDK bugs.

It can be broken down into smaller goals:

define metrics essential for messaging scenarios
define Metrics API in azure-core
metrics plugin implementations

Scenarios

User scenarios

We expect users to be interested to know how many messages were received, processed, checkpointed; what's the delay of messages consumers receive; batch size, success rate of network operations and other key metrics we're going to define. Some of these metrics can be calculated from traces, but not all of them and we're going to focus on the latter ones. Metrics would provide more performant, cheap and production-ready solution than tracing.

We expect users to have one or another metrics solution in their app. Based on Spring One survey, 90%+ of attendees use an APM tool (for logs, metrics, or traces), out of them, ~20%+ use Prometheus, ~30% use Azure Monitor.

SDK scenarios

Supportability: our TSGs should include steps that ask users to check metrics emitted by SDK instead of verbose logs. It'd help narrow down problems without reconfiguring logging and reproducing it.
Stress tests: assuming SDKs report metrics, stress tests would be just a regular user of this feature. If we see an issue in stress test run, we can use built-in metrics to investigate it in the same way as users would.

Usage beyond messaging SDKs

HTTP-based SDK: limited and can be done in core. Can be done automagically in tracing calls before sampling.
thick clients: CosmosDB (already uses Micrometer in Java, has similar ask for .NET)

Beyond Java

.NET: OTel Metrics are included in DiagnosticSource 6.0.
Python: OTel metrics API are in RC
JS: OTel metrics are in development
Go: Alpha
C++: Alpha

The proposal here is to polish scenario in Java where we have a partner ask and learn from it before doing any work in other languages.

Metric solution

We're going to pick the solution that

works for Spring Cloud
works with Azure Monitor and Prometheus and
- compatible with variety of other APM vendors
- has a fair amount of existing instrumentations

Summary

we'll have Meter API abstractions in azure-core
Provide OTel-based implementation for Meter APIs
Spring will keep using micrometer and will provide Micrometer-based implementation similar to this one sample

OTel vs Micrometer analysis

Metrics API

Closely follow OTel metrics API, Micrometer APIs are quite similar. Do only a subset, perf is more important than convenience:

Naming choice: Azure prefix is added to avoid collision with OTel Meter.

Example of usage in Client libraries

// Create attributes with possible error status could be created upfront, usually along with client instance.
Map<String, Object> successAttributes = createAttributes("http://service-endpoint.azure.com", false);
Map<String, Object> errorAttributes = createAttributes("http://service-endpoint.azure.com", true);

// Create instruments for possible error codes. Can be done lazily once specific error code is received.
AzureLongCounter successfulHttpConnections = defaultMeter.createLongCounter("az.core.http.connections",
    "Number of created HTTP connections", null, successAttributes);

AzureLongCounter failedHttpConnections = defaultMeter.createLongCounter("az.core.http.connections",
    "Number of created HTTP connections", null, errorAttributes);

boolean success = false;
try {
    success = connect();
} finally {
    if (success) {
        successfulHttpConnections.add(1, currentContext);
    } else {
        failedHttpConnections.add(1, currentContext);
    }
}

Users apps

Basic

// configure OpenTelemetry SDK as usual and register global configuration
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
    .registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
    .build();

OpenTelemetrySdk.builder()
    .setMeterProvider(meterProvider)
    .setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
    .buildAndRegisterGlobal();

// configure Azure Client, no metric configuration needed, client will use global OTel configuration
AzureClient sampleClient = new AzureClientBuilder()
    .endpoint("https://my-client.azure.com")
    .build();

// use client as usual, if it emits metric, they will be exported
sampleClient.methodCall("get items", Context.NONE);

Custom configuration and along with tracing

// configure OpenTelemetry SDK as usual
SdkMeterProvider meterProvider = SdkMeterProvider.builder()
    .registerMetricReader(PeriodicMetricReader.builder(OtlpGrpcMetricExporter.builder().build()).build())
    .build();

OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
    .setMeterProvider(meterProvider)
    .setPropagators(ContextPropagators.create(W3CTraceContextPropagator.getInstance()))
    .build();

// Pass OTel meterProvider to MetricsOptions - it will be used instead of implicit global singleton.
MetricsOptions customMetricsOptions = new MetricsOptions()
    .setProvider(meterProvider);

// configure Azure Client, no metric configuration needed
AzureClient sampleClient = new AzureClientBuilder()
    .endpoint("https://my-client.azure.com")
    .build();

Span span = openTelemetry.getTracer("azure-core-samples")
    .spanBuilder("doWork")
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    // do some work

    // Current context flows to OpenTelemetry metrics and is used to populate exemplars
    String response = sampleClient.methodCall("get items");
    // do more work
}

span.end();

Messaging Metrics

Prior art

Current EventHubs broker metrics
- Requests: incoming/outgoing, success, throttles
- Messages: incoming, outgoing, captured
- Bytes: incoming, outgoing, size of EH
Track -1 ServiceBus performance counters
- SendMessage/ReceiveMessage/CompleteMessage/AcceptSeasstion/CancelScheduled
  - count (error/success)
  - rate (error/success)
  - duration
  - per namespace and per entity
- Exceptions: count and rate (by type)
- TokenAcquisition: rate (success/error), latency
- Pending ReceiveMessage/AcceptMessageSession/AcceptMessageSessionByNamespace: count
- EventProcessor process: latency, batch size
- Connections: reset count (per entity), redirect count
- Prefetch queue size and depth(?) per entity
- Throughput (in/out): byte rate (per ns/entity)
XBox EventHubs Perf Counters - internal
- Blob offset store: time since last offset flush
- Producer (per topic):
  - Latency
  - Throughput
  - Request rate, retry rate, timeout rate, error receive rate
  - Transmission error rate
  - Event rate per partition
- Buffered producer
  - Queue size
  - Queue full rate
  - Enqueue rate
  - Batch size
  - Event Time in queue
- Consumer (per topic, per partition)
  - Lag (which is two other metrics)
    - last received (seqNo) - last published (seqNo)
    - Receive rate (success/error)
    - Producer-to-consumer latency (receive timestamp - enqueued-time) - approx?
    - Seconds to Zero: Lag * consumption rate
    - Consumption rate
    - Consumer queue size
    - Delivery queue: size, incoming rate, delivery rate, delivery failure rate
Current Kafka metrics
- Producer:
  - batch size, splits
  - throughput: outgoing bytes, compression rate
  - metadata age
  - throttle time
  - record: errors, rate, time in send buffer, retries, size
  - request: rate, size, active
  - response rate, bytes
- Connections: close, creation, io stats
- Consumer:
  - fetch: latency, rate, size, throttle time, counts
  - records: rate, lag, batch size
  - bytes consumed
  - consumer groups: partitions, commit latency, rate, join rate, etc
Kafka proposal
- intent: Kafka client library internals observability
- metrics:
  - connections creations/active, errors
  - requests: rate, rtt, errors
  - internal queue latency, size
  - client io wait time
  - producer queue size, bytes
  - consumer
    - poll interval, latency, last time
    - consumer queue count, bytes
    - consumer group: errors, rebalance, partitions counts
DataDog article on Key RabbitMq metrics
- broker side, mostly irrelevant
DataDog article on Kafka metrics
- producer: response/response rate, latency, io wait time, batch size, throughput produced and batch-compression rate
- consumer: record lag records rate, fetch rate, throughput consumed
OTel proposal, early WIP

EventHubs Metrics Proposal

Report metrics that are useful for customers when operating applications with EventHubs or ServiceBus. We can add more to expose internals later.

All

Metric	Type	Comment
Last offset on broker	counter	[TODO] Opt-in, offset of the last message published successfully
Last sequence number on broker	counter	[TODO] Opt-in, sequence number of the last message published successfully
AMQP link: errors	counter	link errors counter by error code
AMQP session: errors	counter	session errors counter by error code
AMQP Connections: active	up-down-counter	Number of active connections; available on broker, not per client process
AMQP Connections: creations	counter	Number of created connections; available on broker, not per client

Dimensions:

Namespace
Entity
EntityPath

Both can only be reported as opt-in metrics (additional charges apply), customers would be expected to opt in on either producer or consumer.

Producer

Metric	Type	Comment
Send: duration	histogram	Number of milliseconds send ProducerClient.Send call takes with all retries
Send: messages in batch	counter	Number of messages sent per Producer.Send call
Send: bytes in batch	histogram	Number of bytes sent per Producer.Send call
AMQP link: send duration	histogram	Response time (in milliseconds) of AMQP request

Dimensions:

Namespace
Entity
EntityPath
Error code (or success)

Notes:

can calculate attempts metrics, e.g. avg attemps # = count(link_duration)/count(send_duration). If it's proven to be insufficient, we can come up with a better one.

Consumer

Metric	Type	Comment
AMQP: messages received	counter	Number of messages received per Consumer.Receive call
AMQP: credits requested	counter	Number of credits requested from broker.
Processor: duration	histogram	available on broker, not per client
Processor: error handler	counter	Error Handler Invocations
Checkpoint: duration	histogram	available on broker, not per client
Checkpoint: last offset checkpointed	counter
Checkpoint: last sequence number checkpointed	counter

Dimensions:

Namespace
Entity
Error code (or success)
EntityPath
Consumer GroupId

It will allow following views with slicing, dicing and filtering per any dimension

Histogram: count, rate, percentiles, avg, max
Gauge: count, rate, max, avg, sum
Counters: count, rate, total, avg, max

...

[WIP] Spec: https://gist.github.com/lmolkova/489a2b280b8fa68e4c3780c2afaa3b39

OpenTelemetry

status (5/9/2022): API and SDK stable as of 1.14
OpenTelemetry micrometer plugin: alpha
Application Insights agent: supports otel metrics in 3.3.0-beta release
Azure Monitor exporter: does not support metrics - TBD - roadmap
Other exporters: OTLP - stable, Prometheus - alpha
OTel exporter registry - here're the backends that support metrics (diff with Micrometer in bold): AWS CloudWatch, Datadog, Dynatrace, Elastic, Graphite, Influx, Instana, JMX, NewRelic, Stackdriver, Sumologic, Logzio, Honeycomb, Prometheus, SignalFx, StatsD (as a source), Wavefront
OTel instrumentations registry - enormous list both traces (and metrics from traces).
Semantics: OTel attempts to standartize metrics, dimensions and attribute names accross languages for generic scenarios (e.g. messaging)

Micrometer

Status: stable
Application Insights agent: supports micrometer (stable)
OpenTelemetry Java agent: supports micrometer (stable)
OpenTelemetry micrometer plugin: alpha
Micrometer backend registry (diff with OTel in bold) - AppOptics, Atlas, AWS CloudWatch, Datadog, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, NewRelic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront.
Micrometer instrumentations - Spring Boot, JVM, Cache, OkHttp, Jetty and Jersey.
Micrometer does not have guidance or standards on attributes for generic scenarios

OpenTelemetry has alot of instrumentations available in OTel, supporting it would mean minimizing future list of dependencies for users. Micrometer is more stable solution though.

OTel and Micrometer provide similar sets of Meters (sync and call-back based): counters, gauge, histogram.

OTel supports exemplars of metrics that allows to see examples of traces corresponding to specific measurement
OTel allows to efficiently and conveniently use dynamic attribute values

lmolkova/metrics_azure_sdk.md

Metrics in Azure SDK for Java

Scenarios

User scenarios

SDK scenarios

Usage beyond messaging SDKs

Beyond Java

Metric solution

Summary

Metrics API

Example of usage in Client libraries

Users apps

Basic

Custom configuration and along with tracing

Messaging Metrics

Prior art

EventHubs Metrics Proposal

All

Producer

Consumer

OpenTelemetry

Micrometer

Plan