When customers experience issues with Azure SDKs, we frequently want them to repro the problem and provide logs, potentially increasing the log level for SDK. Then we parse these logs to understand what happens and narrow down the issue. This process is cumbersome:
- logs are huge
- logs don't have strict format:
- layout depends on customer configuration
- there is no standard Azure SDK message format
- we rely on manual and flexible grep+regex to parse logs
Write Azure SDK logs in a way that we can automate parsing, querying and potentially analyzing them. Solution should work with:
- log storage/APM tools: most customers send their logs somewhere (Splunk, Elastic, Azure Monitor/Log Analytics, etc), the solution needs to work with such tools.
- traditional log files: when customers are able to write and share them, we should be able to parse them too
We want this solution for:
- ourselves when we parse logs
- CSS and customers: we should be able to write troubleshooting guides involving simple queries and analysis for common issues
There are more problems: increasing the log level and repro the issue. It's out of scope of this particular proposal. Even though it's extremely important, it has much bigger scope and should be in scope of the long-term supportability effort.
This proposal is a solution for cases where log collection is inevitable and to improve time needed to investigate issues (mostly AMQP SDKs).
- Introduce ability to record context along with log message through new
ClientLogger
API - Write context in JSON, so it can be parsed easily with any scripts and also with log storage/query tools
Introduce new fluent APIs to add key-value-pairs similar to SLF4J v2 Fluent API.
logger.atInfo()
.addKeyValue("key1", val1)
.addKeyValue("key2", val2)
.log("hello world");
Alternatively we can pass context as map or varargs without fluent pattern under assumption varargs have better perf/lower memory pressure.
logger.info("hello world", "key1", val1, "key2", val2);
Fluent pattern is more friedly and leaves smaller change for mistake mixing up key-value-pairs, there is no perf gain from varagrs. (TODO: memory pressure?)
Prototype: draft PR
Let's keep context within Azure SDK log message. e.g.
2021-10-01 06:32:42,732 INFO [com.azure.core.amqp.implementation.handler.ReceiveLinkHandler] (reactor-executor-183) {"az.sdk.message": "Receiver link was never active. Closing endpoint states. ", "connectionId":"MF_c991bc_1632981438863", "linkName":"session-_368299_1633069903838", "entityPath":"pms-restrictions"}
Pros:
- [high-priority] it keeps message and context format fully in our control.
- [high-priority] easy to parse with automated tooling on top of log files
- [medium-priority] it's possible to parse Azure SDK messages at query time in Kusto(json or regex) or Splunk and populate columns from it when investigating issues. e.g. we can use this query ourselves and share with customer
- [low-priority] it will keep existing logs, that don't use context, untouched
Cons:
- [?-pri] ~Adds 5-10% perf hit on top of log call with formatted string. Absolute value is low (2000 ns) and presumably we don't write much warning/error logs and users don't turn info on by default.
- [low-pri] context is not populated by default in a separate column when logs are exported to AzMon, Splunk or other logging storage tools.
The assumption is that SDK-specific context (e.g. connectionId
, linkName
, etc) is not generally useful for customers, users don't need them in a separate column unless investigating SDK/Azure service issue - which should be rare and probably can be done with at query-time instead in a few popular tools. Kusto (AzMon) and Splunk support message paring, Elasticsearch does not, but we can still ask customers to export logs from their tool of choice and process them ourselves is absolutely have to.
MDC vs explicit context in message: MDC has 50% overhead.
Benchmark Mode Cnt Score Error Units
LoggingBenchmark.loggingAtDisabledLevel avgt 15 1.268 ± 0.153 ns/op
LoggingBenchmark.loggingAtDisabledLevelWithContext avgt 15 1.524 ± 0.092 ns/op
LoggingBenchmark.loggingAtDisabledLevelWithMDC avgt 15 1.588 ± 0.252 ns/op
LoggingBenchmark.loggingAtEnabledLevel avgt 15 2501.731 ± 549.315 ns/op
LoggingBenchmark.loggingAtEnabledLevelWithContext avgt 15 2711.800 ± 259.968 ns/op
LoggingBenchmark.loggingAtEnabledLevelWithMDC avgt 15 3624.982 ± 193.837 ns/op
Fluent pattern (loggingAtEnabledLevelWithContext
) vs varagrs (loggingAtEnabledLevelWithContext2
) don't show significant difference in perf:
Benchmark Mode Cnt Score Error Units
LoggingBenchmark.loggingAtEnabledLevel avgt 15 1847.670 ± 128.052 ns/op
LoggingBenchmark.loggingAtEnabledLevelWithContext avgt 15 2115.740 ± 134.049 ns/op
LoggingBenchmark.loggingAtEnabledLevelWithContext2 avgt 15 2106.112 ± 79.263 ns/op
Example of log file:
2021-10-01 06:32:42,732 INFO [com.azure.core.amqp.implementation.handler.ReceiveLinkHandler] (reactor-executor-183) connectionId[MF_c991bc_1632981438863] linkName[session-_368299_1633069903838] entityPath[pms-restrictions] Receiver link was never active. Closing endpoint states.
Log files have layout configured by user application. In above example, it's <date> <time> <level> [<full-class-name>] (<thread-name>) <message>
. Azure SDK only controls format of the message
and it's unreasonable to ask customers to change their log formats.
Summary: Structured logging with SLF4J v2 or MDC involves custom context configuration (location and layout), which makes format unpredictable for us and we can't even rely on its presence.
It's not reasonable or possible to ask customers to write JSON logs. Layouts that work with JSON rely on jackson-databind
. Existing solutions that are not specific to logging solution vendor (e.g. Splunk) are out-of-date and not well-maintained (e.g. logback json).
SLF4J v2 adds support for structured logging. Here's an example:
logger.atInfo()
.addKeyValue("db.name", dbName)
.addKeyValue("db.operation", operation)
.addKeyValue("db.demo.collection", collectionName)
.addArgument(() -> String.format("%s %s.%s", operation, dbName, collectionName))
.log("'{}' call ended");
Layout config: note it uses %kvp
- context is not populated by default.
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<layout class="ch.qos.logback.classic.PatternLayout">
<Pattern>
%d{yyyy-MM-dd HH:mm:ss} - %logger{36} - %msg %kvp %n
</Pattern>
</layout>
</appender>
leads to following log record:
[main] INFO org.example.mydb.DbClient - db.name=store db.operation=delete db.demo.collection=icecreamorders 'delete store.icecreamorders' call started
Summary: Not useful even in v2, where we can use SLF4J v2(after it gets stable): we don't control how/if it's populated (needs separate configuration in the layout)
- Relies on
ThreadLocal
and bunch of hacks/customizations to propagate it in async calls. - Explicitly configured by app.
MDC.put("message", message);
LOGGER.info("Going to send message {}.", message);
MDC.clear();
Layout config (%X
):
%d{yyyy-MM-dd HH:mm:ss} - %logger{36} - %msg %X %n
%d{yyyy-MM-dd HH:mm:ss} - %logger{36} - %msg message=%X{message} trace_id=%X{trace_id} span_id=%X{span_id}%n
Summary: Not an option
- MDC shows ~1.5X latency on enabled logs with context
- we don't control how/if it's populated (needs separate configuration in the layout)
summary: Json is way easier at query time
Json: {"az.sdk.message":"hello world", "connectionId":"foo","linkName":"bar"}
traces
| where message startswith "{\"az.sdk.message"
| project timestamp, logger=customDimensions["LoggerName"], level=customDimensions["LoggingLevel"], thread=customDimensions["ThreadName"], azSdkContext=parse_json(message)
| evaluate bag_unpack(azSdkContext)
Regex example: hello world, az.sdk.context: connectionId[foo], linkName[bar]
traces
| where message contains "az.sdk.context:"
| project azSdkContext=substring(message, indexof(message, "az.sdk.context:") + 15)
| project parsed = extract_all(@"(\w+)\[([^\]]+)\]", azSdkContext)
| mv-apply kvp = parsed to typeof(dynamic) on (
summarize dict = make_bag(pack(tostring(kvp[0]), tostring(kvp[1])))
)
| project dict
| evaluate bag_unpack(dict)
SDKs should avoid multi-line messages. We also remove new lines from messages in ClientLogger
.
Exceptions Application Insights collects exception as an object (if SDK writes them when debugging is enabled or if exception is unhandled and bubbles all the way up)
Logging pipelines and tools support concatenation: fluentd, Splunk supports merging multiline messages into one (based on log format) for syslog, stdout from containers, etc.
Azure/azure-sdk-for-java#25247