Logging Standards and Best Practices

Structured Log Format / Semantic Logging
- Details
What To Log / Log Levels
- Our Auditing/Monitoring Policy
  - What to log?
- OWASP Logging Guidelines
  - What to log?
  - What NOT to log?
Enforcement of Logging / KPIs
Implementation

Structured Log Format / Semantic Logging

From the company logging and monitoring policy:

xiii. Information systems shall be able to automatically process audit records for events of interest based on selectable criteria.

Te ensure this, we should use structured logs, which are usually more preferable than free-form text lines that need to be parsed. In a complex distributed system, the benefits of using structured logging are immense.

Logs, when viewed in isolation, should be meaningful.
Less is better than more.
Roll up many related logs from many related lines of code into one structured log entry.

Details

The common pattern of logging information as strings of text is formed by the misconception that logs are primarily for humans to read, first, and that the human doing the log reading will know what to look for when trying to digest logs. However, in a complex environment, these tenents don't hold true.

The primary benefit of structured logging is that it makes interpreting the logs more dynamic, such that code can be written, and thus, tools can be built, to help process and analyze logs. It will be easier to find exactly what you're looking for in historical logs if you're using a Structured Log Format. Additionally, it will be easier to build monitoring/alerting tooling based on structured logs. Logs should be considered as events, just like anything else in a distributed system; the more relevant context and metadata they have, the better.

That being said, simply dumping a huge JSON blob or object.__repr__ over and over with the same information contained within also isn't necessarily ideal, either, vs dumping lines of free-text over and over. Semantic/Structured logging is great, but consistency is key, and information overload should be avoided. One way to do this is by having log contexts. Many individual logger entries can be aggregated together, and then flushed as one structured log entry; this is easier to understand than many redundant log entries where only a few values change.

What To Log / Log Levels

Python has many log levels.

Use the one that is appropriate for each use case, don't simply log everything as info.

debug for development logging only, such as logging raw requests/responses, granular state change logging, etc
info for success scenarios, or when something interesting is happening that is expected
warning if something unexpected happens; not quite an error, but important to pay attention to
error for failures that might be recoverable, such as exceptions that can be handled safely.
critical should be a last resort doomsday scenario, such as the application being totally unusable

A stack trace should never be printed, or simply logged as info, for example.

Our Auditing / Monitoring Policy

We must ensure that logs include important information for auditing purposes (when, where, who, what).

What to log?

Per veda's logging and monitoring policy, audit records should include (with some examples):

Date/time
Unique user ID
Data subject ID
- entity ID
- activity ID
- tenant ID
- database record ID
Filename or location accessed
- AWS ARN
- S3 URI
- Database name
- Service URI
Program or command used
- lambda name
- service name
- module name
- SDK name
Action performed
- module method
- SDK method
- CRUD action
Success/Failure of Action Performed

OWASP Logging Guidelines

What to log?

The OWASP logging cheatsheet suggests the following should be logged whenever possible:

Input validation failures e.g. protocol violations, unacceptable encodings, invalid parameter names and values
Output validation failures e.g. database record set mismatch, invalid data encoding
Authentication successes and failures
Authorization (access control) failures
Session management failures e.g. cookie session identification value modification
Application errors and system events e.g. syntax and runtime errors, connectivity problems, performance issues, third party service error messages, file system errors, file upload virus detection, configuration changes
Application and related systems start-ups and shut-downs, and logging initialization (starting, stopping or pausing)
Use of higher-risk functionality e.g. network connections, addition or deletion of users, changes to privileges, assigning users to tokens, adding or deleting tokens, use of systems administrative privileges, access by application administrators,all actions by users with administrative privileges, access to payment cardholder data, use of data encrypting keys, key changes, creation and deletion of system-level objects, data import and export including screen-based reports, submission of user-generated content - especially file uploads
Legal and other opt-ins e.g. permissions for mobile phone capabilities, terms of use, terms & conditions, personal data usage consent, permission to receive marketing communications
Optionally consider if the following events can be logged and whether it is desirable information:
- Sequencing failure
- Excessive use
- Data changes
- Fraud and other criminal activities
- Suspicious, unacceptable or unexpected behavior
- Modifications to configuration
- Application code file and/or memory changes

What NOT to log?

The following should not usually be recorded directly in the logs, but instead should be removed, masked, sanitized, hashed or encrypted:

Application source code
Session identification values (consider replacing with a hashed value if needed to track session specific events)
Access tokens
Sensitive personal data and some forms of personally identifiable information (PII) e.g. health, government identifiers, vulnerable people
Authentication passwords
Database connection strings
Encryption keys and other master secrets
Bank account or payment card holder data
Data of a higher security classification than the logging system is allowed to store
Commercially-sensitive information
Information it is illegal to collect in the relevant jurisdictions
Information a user has opted out of collection, or not consented to e.g. use of do not track, or where consent to collect has expired

Sometimes the following data can also exist, and whilst useful for subsequent investigation, it may also need to be treated in some special manner before the event is recorded:

File paths
Database connection strings
Internal network names and addresses
Non sensitive personal data (e.g. personal names, telephone numbers, email addresses)

Enforcement of Logging / KPIs

Code reviews and tests are both equally imporant to logging standards enforcement.

Domain Probes can be used to write better tests, especially when used with structured logging.

Structured logging, when paired with Domain Probes, should lead to logging guarantees as part of our normal unit tests. Since all important business logic methods would have clearly defined probes, logging is present and can be tested too, naturally.

If this is done correctly, we can ensure that our logs are nicely coupled to the relevant business logic, and we can easily prove that logs are working as expected, alongside business logic, by testing both together as part of our normal practices; this isn't easily doable with simple print() statements.

For example, if you want metrics, it's easy to hook in to the data generated by domain probes, and use the structured logs to derive the needed metrics. Doing this also allows us to write tests that prove that we are hitting our metrics/KPIs, as well!

Implementation

Logger Namespacing in Python

Namespace all logger objects, so they can be aggregated, filtered, and understood more easily.

In python, simply add __name__ to the logger name at the top of your module, like so: logging.getLogger(__name__)

If this is done, all logs generated from that module will include the module name by default.

Details

In a distributed environment, where many logs are aggregated into one stream, context may be lost if one isn't careful with log formatting.

When logging in a more traditional naive way, for example, by printing/logging lines to a file or stdout, it's not possible, by default, to tell the source of a particular log entry unless we add it manually to the output. Obviously, that's not ideal, and the cognitive burden is frustrating, both when reading and writing the code, and it would lead to a lot of superfluous typing.

Here's an example of the default logging behavior that is often used in Python:

import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

This isn't ideal, by default, for a few reasons:

One, it's not good practice to set the log level for a module, leave that to the caller, that's the point of using the logger library in the first place (to make logging more dynamic), and the caller of the module should be the one controlling the logging.

Secondly, the logger is unnamed (no args passed to getLogger). Python has magic/dunder methods like __name__ that hold the module name, or the definition name (func, class, method, etc)

Here's a decent example of a real veda log entry that has a logger name set:

INFO:crawlers.anthem.anthem_utils:{"msg": "Could not parse state None", "domain": {"task_type": "crawler", "domain_id": "bnlzZWRfZG9jdG9y", "swarm_id": "27b8a12b-8559-46ee-8de4-047bc4d85d0f", "crawler_id": "9db70359-b678-4a2f-935a-e597b97a1f58"}}
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     this is the logger name, populated from `__name__`, like so: `logger = logging.getLogger(__name__)`

...but it could still be improved. For example, it's not clear what method is being invoked in anthem.anthem_utils. Further, this log actually looks like a failure Could not parse state None but it's logged as INFO and it's also missing a timestamp, the action performed, as well as the outcome of the action.

Stuctured Logging Libraries and Tooling

The Python standard library has all the needed pieces to implement basic structured logging. See this cookbook for more details on how to roll your own basic structured logging. Another built-in option is to use LogRecordFactories.

Otherwise, this is a great third party library for inspiration: structlog that will make working with structured logging very easy.

Logging For Observability

Domain Oriented Observability is an important concept to understand in the context of logging.

Domain observable logs should be business-logic oriented, for example, can you derive KPI metrics from your logs? Often, as engineers, we add very low level and technical logs to our applications. But this level of data may not be useful to all systems that are actually using our logs.

That's not to say that diagnostic logs are not important, low level logs should definitely be used where it makes sense, but we should also log higher level business/domain events.

One good way to know if your logs are cohesive is to try to parse only the logs, and see if you can figure out everything that happened.

Domain Probes

A Domain Probe presents a high-level instrumentation API that is oriented around domain semantics, encapsulating the low-level instrumentation plumbing required to achieve Domain-Oriented Observability.

In a well designed system, there should already be natural logging points for domain probes, such as when certain class methods are called.

This probe design also gives us a very obvious place to expect logging to be present, which should be enforced in code reviews, and this brings us to the next section on enforcement of logging standards.

gordol/python-application-logging-best-practices.md

Logging Standards and Best Practices

Structured Log Format / Semantic Logging

Details

What To Log / Log Levels

Our Auditing / Monitoring Policy

What to log?

OWASP Logging Guidelines

What to log?

What NOT to log?

Enforcement of Logging / KPIs

Implementation

Logger Namespacing in Python

Details

Stuctured Logging Libraries and Tooling

Logging For Observability

Domain Probes