Skip to content

Instantly share code, notes, and snippets.

@lmolkova
Last active July 13, 2021 05:02
Show Gist options
  • Save lmolkova/1dea776c8d6e8949f3867d42979c66c5 to your computer and use it in GitHub Desktop.
Save lmolkova/1dea776c8d6e8949f3867d42979c66c5 to your computer and use it in GitHub Desktop.

Problem

Same operation may be instrumented multiple times (manual + auto or library-native + auto) because of multiple reasons (below). This usually affects protocol instrumentation (HTTP/gRCP/etc) as they are auto-instrumented already and quite popular. It manifests as duplicated spans that fights for context injection on RPC calls, double performance impact and increase telemetry bills. Here are some cases when it happens:

  1. specialized instrumentation: native library instrumentaion can provide more rich data, better quality/performance + auto-instrumentation thats always on
  2. new library behavior (user manually instrumented and then new version brings auto-instrumentation)
  3. configuration error

While p1 is valid case, and p2/p3 are not, but we'd still would rather communicate the need to remove extra instrumentation instead of duplicating data.

Assumptions

Example:

  • there may be a logical client span that represents composite call to http client (e.g. with internal-retry/redirect logic) or some high-level operation.
  • there are RPC spans that represent individual rpc call tries (may be redirects, etc)
  • may be some day there will be more: underlying DNS spans, content stream reading, individual streaming messages, etc.

Here are the assumptions:

  1. different layers of instrumentation MUST have different semantics:
    • logical high-level calls wrap multiple RPC calls and can't have protocol-level status code among other things
    • RPC calls are the ones with URLs, status codes, etc
    • DNS/streaming calls/content reading have yet another semantics
  2. All of the above spans may be CLIENT
  3. we don't know what's important to users, they would have different answers, i.e we can't assign verbosity to the layer
    • logical calls are important - some users don't care much how client library/http client behave internally. At least they prioritize it above lower level details to reduce the bill.
    • rpc calls are important too - most client libraries are thin and just proxy RPC calls, logical calls in this case are not very useful
    • with streaming APIs, you have one long request and you're more interested in messages that your apps exchange within it than the overall request span
  4. Terminal instrumentation marker is too narrow and only solves recursive telemetry reporting issues, e.g.:
    • logical call would tell it's terminal, then nobody would propagate context downstream
    • rpc calls would tell they are terminal, then future DNS/content reading won't be traceable
  5. It seems we should allow multiple different kinds of instrumentations to coexists, but there MUST be at most one layer of every instrumentation.
  6. Duplication and multiple layers don't necessarily happen on every request (HTTP requests could be instrumented within client libraries and it's a subset of all requests app does) - process-wide configuration-based suppression doesn't work)

Ideas

  • We need clear separation of concerns for instrumentation: e.g. logical calls (even when they are composite HTTP-client calls with potential retries handling) are NOT HTTP. They are NOT injecting context into RPC requests. we can end up with different rules, but we need clarity and consistency between multiple clients/languages/usage patterns.
  • Users/distros may opt out of the certain levels of instrumentation? fully, once per process?
    • Context propagation MUST work anyway (i.e. if HTTP instrumentation is turned off, but parent context exists and is valid, instrumentation propagates current context as a best effort)
  • Idea: There is a context marker that prevents extra layers of the same instrumentation to happen:
    • each instrumentation has type in addition to schema URL. type is not version-specific
    • instrumentation type is exposed on either 1) spans or 2) context
    • instrumentation can't create span with the same instrumentation type, SDK may return noop span, SDK can take care of it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment