Structured logging in Azure SDK for Java

Structured logging in Azure SDK for Java

Problem

When customers experience issues with Azure SDKs, we frequently want them to repro the problem and provide logs, potentially increasing the log level for SDK. Then we parse these logs to understand what happens and narrow down the issue. This process is cumbersome:

logs are huge
logs don't have strict format:
- layout depends on customer configuration
- there is no standard Azure SDK message format
we rely on manual and flexible grep+regex to parse logs

Goal

Write Azure SDK logs in a way that we can automate parsing, querying and potentially analyzing them. Solution should work with:

log storage/APM tools: most customers send their logs somewhere (Splunk, Elastic, Azure Monitor/Log Analytics, etc), the solution needs to work with such tools.
traditional log files: when customers are able to write and share them, we should be able to parse them too

We want this solution for:

ourselves when we parse logs
CSS and customers: we should be able to write troubleshooting guides involving simple queries and analysis for common issues

Non-goal

There are more problems: increasing the log level and repro the issue. It's out of scope of this particular proposal. Even though it's extremely important, it has much bigger scope and should be in scope of the long-term supportability effort.

This proposal is a solution for cases where log collection is inevitable and to improve time needed to investigate issues (mostly AMQP SDKs).

Proposal

Introduce ability to record context along with log message through new ClientLogger API
Write context in JSON, so it can be parsed easily with any scripts and also with log storage/query tools

Structured logging API

Introduce new fluent APIs to add key-value-pairs similar to SLF4J v2 Fluent API.

logger.atInfo()
    .addKeyValue("key1", val1)
    .addKeyValue("key2", val2)
    .log("hello world");

Alternatively we can pass context as map or varargs without fluent pattern under assumption varargs have better perf/lower memory pressure.

logger.info("hello world", "key1", val1, "key2", val2);

Fluent pattern is more friedly and leaves smaller change for mistake mixing up key-value-pairs, there is no perf gain from varagrs. (TODO: memory pressure?)

Prototype: draft PR

Log formatting and consumption

Let's keep context within Azure SDK log message. e.g.

2021-10-01 06:32:42,732 INFO [com.azure.core.amqp.implementation.handler.ReceiveLinkHandler] (reactor-executor-183) {"az.sdk.message": "Receiver link was never active. Closing endpoint states. ", "connectionId":"MF_c991bc_1632981438863", "linkName":"session-_368299_1633069903838", "entityPath":"pms-restrictions"}

Pros:

[high-priority] it keeps message and context format fully in our control.
[high-priority] easy to parse with automated tooling on top of log files
[medium-priority] it's possible to parse Azure SDK messages at query time in Kusto(json or regex) or Splunk and populate columns from it when investigating issues. e.g. we can use this query ourselves and share with customer
[low-priority] it will keep existing logs, that don't use context, untouched

Cons:

[?-pri] ~Adds 5-10% perf hit on top of log call with formatted string. Absolute value is low (2000 ns) and presumably we don't write much warning/error logs and users don't turn info on by default.
[low-pri] context is not populated by default in a separate column when logs are exported to AzMon, Splunk or other logging storage tools.

The assumption is that SDK-specific context (e.g. connectionId, linkName, etc) is not generally useful for customers, users don't need them in a separate column unless investigating SDK/Azure service issue - which should be rare and probably can be done with at query-time instead in a few popular tools. Kusto (AzMon) and Splunk support message paring, Elasticsearch does not, but we can still ask customers to export logs from their tool of choice and process them ourselves is absolutely have to.

Perf

MDC vs explicit context in message: MDC has 50% overhead.

Benchmark                                           Mode  Cnt     Score     Error  Units
LoggingBenchmark.loggingAtDisabledLevel             avgt   15     1.268 ±   0.153  ns/op
LoggingBenchmark.loggingAtDisabledLevelWithContext  avgt   15     1.524 ±   0.092  ns/op
LoggingBenchmark.loggingAtDisabledLevelWithMDC      avgt   15     1.588 ±   0.252  ns/op
LoggingBenchmark.loggingAtEnabledLevel              avgt   15  2501.731 ± 549.315  ns/op
LoggingBenchmark.loggingAtEnabledLevelWithContext   avgt   15  2711.800 ± 259.968  ns/op
LoggingBenchmark.loggingAtEnabledLevelWithMDC       avgt   15  3624.982 ± 193.837  ns/op

Fluent pattern (loggingAtEnabledLevelWithContext) vs varagrs (loggingAtEnabledLevelWithContext2) don't show significant difference in perf:

Benchmark                                           Mode  Cnt     Score     Error  Units
LoggingBenchmark.loggingAtEnabledLevel              avgt   15  1847.670 ± 128.052  ns/op
LoggingBenchmark.loggingAtEnabledLevelWithContext   avgt   15  2115.740 ± 134.049  ns/op
LoggingBenchmark.loggingAtEnabledLevelWithContext2  avgt   15  2106.112 ±  79.263  ns/op

Additional Context

Example of log file:

2021-10-01 06:32:42,732 INFO [com.azure.core.amqp.implementation.handler.ReceiveLinkHandler] (reactor-executor-183) connectionId[MF_c991bc_1632981438863] linkName[session-_368299_1633069903838] entityPath[pms-restrictions] Receiver link was never active. Closing endpoint states.

Log files have layout configured by user application. In above example, it's <date> <time> <level> [<full-class-name>] (<thread-name>) <message>. Azure SDK only controls format of the message and it's unreasonable to ask customers to change their log formats.

Structured logging overview

Summary: Structured logging with SLF4J v2 or MDC involves custom context configuration (location and layout), which makes format unpredictable for us and we can't even rely on its presence.

It's not reasonable or possible to ask customers to write JSON logs. Layouts that work with JSON rely on jackson-databind. Existing solutions that are not specific to logging solution vendor (e.g. Splunk) are out-of-date and not well-maintained (e.g. logback json).

SLF4J v2

SLF4J v2 adds support for structured logging. Here's an example:

logger.atInfo()
    .addKeyValue("db.name", dbName)
    .addKeyValue("db.operation", operation)
    .addKeyValue("db.demo.collection", collectionName)
    .addArgument(() -> String.format("%s %s.%s", operation, dbName, collectionName))
    .log("'{}' call ended");

Layout config: note it uses %kvp - context is not populated by default.

<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
  <layout class="ch.qos.logback.classic.PatternLayout">
    <Pattern>
      %d{yyyy-MM-dd HH:mm:ss} - %logger{36} - %msg %kvp %n
    </Pattern>
  </layout>
</appender>

leads to following log record:

[main] INFO org.example.mydb.DbClient - db.name=store db.operation=delete db.demo.collection=icecreamorders 'delete store.icecreamorders' call started

Summary: Not useful even in v2, where we can use SLF4J v2(after it gets stable): we don't control how/if it's populated (needs separate configuration in the layout)

MDC

Relies on ThreadLocal and bunch of hacks/customizations to propagate it in async calls.
Explicitly configured by app.

MDC.put("message", message);
LOGGER.info("Going to send message {}.", message);
MDC.clear();

Layout config (%X):

%d{yyyy-MM-dd HH:mm:ss} - %logger{36} - %msg %X %n
%d{yyyy-MM-dd HH:mm:ss} - %logger{36} - %msg message=%X{message} trace_id=%X{trace_id} span_id=%X{span_id}%n

Summary: Not an option

MDC shows ~1.5X latency on enabled logs with context
we don't control how/if it's populated (needs separate configuration in the layout)

Kusto: parse

summary: Json is way easier at query time

Json: {"az.sdk.message":"hello world", "connectionId":"foo","linkName":"bar"}

traces 
| where message startswith "{\"az.sdk.message"
| project timestamp, logger=customDimensions["LoggerName"], level=customDimensions["LoggingLevel"], thread=customDimensions["ThreadName"], azSdkContext=parse_json(message)
| evaluate bag_unpack(azSdkContext)

Regex example: hello world, az.sdk.context: connectionId[foo], linkName[bar]

traces 
| where message contains "az.sdk.context:"
| project azSdkContext=substring(message, indexof(message, "az.sdk.context:") + 15)
| project parsed = extract_all(@"(\w+)\[([^\]]+)\]", azSdkContext)
| mv-apply kvp = parsed to typeof(dynamic) on (
    summarize dict = make_bag(pack(tostring(kvp[0]), tostring(kvp[1])))
)
| project dict
| evaluate bag_unpack(dict)

Multiline logs

SDKs should avoid multi-line messages. We also remove new lines from messages in ClientLogger.

Exceptions Application Insights collects exception as an object (if SDK writes them when debugging is enabled or if exception is unhandled and bubbles all the way up)

Logging pipelines and tools support concatenation: fluentd, Splunk supports merging multiline messages into one (based on log format) for syslog, stdout from containers, etc.

lmolkova/structured_logging.md