Using a non-inverted index for out-of-line storage of long and/or high-churn Prometheus series labels

High level design

To solve the various problems around identifying string-valued labels and differentiating them from normal labels in the TSDB's inverted indexes, support would not be added for string-valued labels in normal selectors; foo{string_label_x=~"a.+"} would find zero results if string_label_x was a special out-of-line stored string label. Instead:

On ingestion, meta-labels would be added by the client application or by rewrite rules to tell Prometheus to store specific label values out-of-line; these would be stored in the out-of-line label value index, but not be added to the TSDB's inverted indexes. During the conventional TSDB update, the appender would treat them as if a labeldrop had been performed. E.g.:
- foo{job="a",instance="b",...,stringlabelname="value"} could be supplied by the client with an additional __label_storage_external="stringlabelname" label, or this could be added during a rewrite pass
- in this case, the appender would omit stringlabelname="value" from the series foo{job="a",instance="b",...} it appends to the TSDB; then
- the appender would pass the series with its identifying labels, value and timestamp to an external storage appender handler, along with the external label tuple stringlabelname="value"; this handler would be responsible for storing the string-valued labels for the new series datapoint.
On query, new PromQL functions would be used to fetch these labels and their values explicitly on request, like:
- fetch_string_attr(foo{}, "stringlabelname") -> foo{stringlabelname="value"}
- fetch_string_attrs(foo{}) -> foo{stringlabelname1="value", otherstringlabel="value2", ...}
- filter_string_attr(foo{}, "stringlabelname", "filter.*pattern" [, bool addlabel]) -> foo{} satisfying stringlabelname=~"filter.*pattern", optionally emitting foo{stringlabelname="filterblahpattern"} with the label added to the result if the optional bool param to retain the label value is true. PromQL would not "know" that stringlabelname is special so the selector foo{stringlabelname=~".+"} would match nothing, as there are no entries in the TSDB inverted index for stringlabelname.

Using PromQL functions would externalize most of the complexity of the querying and injection of these labels into a well-separated module, minimising impact on the appender and nearly eliminating impact on the PromQL engine. And they'd mean there'd be no need for the PromQL executor to have a way to keep track of which labels are normal and which are special external ones at query-time.

And there's no real harm to having to fetch the whole selector from the tsdb before enriching with the external labels, since this would be necessary for any feasible implementation anyway.

Pros

Executing fetch_string_attr(foo{identifying_label="bar"},"string_value_label") to return foo{identifying_label="bar",string_value_label="long, highly varying string here"} should be reasonably fast, without unduly bloating the TSDB's inverted indexes and related in-memory structures.
Fetching of string-valued labels could be limited to when they're explicitly requested in a query, limiting memory bloat at query-time
String-valued labels could be matched and immediately discarded when being used as a filter, further limiting memory requirements

Cons

Prometheus currently has no way to represent a string value for a series. So a range-vector lookup of one series could result in multiple range-vectors as output with one series per distinct value of the string-valued labels(s). Possibly a lot of them if there are many values. This would duplicate all the identifying labels in-memory, and create a set of sparse range vectors. This would result in query-time memory consumption that is similar to a conventional info-metric based approach. Given that Prometheus currently has no resource limits or defenses of any kinds against high query memory consumption caused by label-sets, this would greatly reduce the benefits offered by this feature.
Searching for and matching series by their string-value data without a highly-selective identifying label filter to reduce the search space (e.g. filter_string_attr(foo{},"string_value_label","blah.*") would be prohibitively expensive and slow for a wide result of foo{}, especially for anything except a prefix-search on the values. It would only be practical for a narrower foo{} or where there's a more selective foo{normal_labels=~"...."} to pre-filter it. Searching by string-valued labels would require an inefficient deep search of the external string-valued label storage index. The series would need to be loaded with their identifying labels from the inverted TSDB indexes like normal, essentially loading all of foo{}. Then for each series search the external label storage index(es) for the non-identifying labels, pattern match them, and keep or discard the series depending on the outcome. This would be rather slow for wide series. But that's the price you'd have to pay for not blowing out Prometheus's TSDB cardinality by using normal labels to carry highly-varying and possibly long label values.

How storage and retrieval might work

The key-search tree structure would be something like this

__label_tuple_digest=13edec49e
  job=foo
    instance=bar
      otheridentifyinglabel=value
        [... more identifying labels ...]
          __oldest_value_timestamp=1755643053
             stringvalue1="long value here"
             stringvalue2="something else"
          __oldest_value_timestamp=1755643057
             stringvalue1="updated long value here"
             stringvalue2="something else"
         ...
__label_tuple_digest=238ad5ff00
  job=foo
    instance=baz
      otheridentifyinglabel=value
        [... data for a different series ...]

To satisfy a query like filter_string_attr(job="foo",somemetric{},"stringvalue1",".+","true") - which asks for all somemetric{job="foo",} in which stringvalue1=~".+" is satisfied in the external label set, with results to include values of stringvalue1 as labels:

all series matching somemetric{job="foo"} would be fetched from the conventional TSDB; then for each matching series:
quick digest of the sorted identifying labels for the series would be searched
the tree would be descended to re-check that the series labels match (it isn't a hash collision) - this might be in one node or split across a few interior nodes
The next level of the tree would be keyed by value-timestamp; the newest value-timestamp older than the series sample's value would be fetched - or for a range query, this operation would be repeated for each sample timestamp in the range vector
on the final level of the tree, the string-value label names and label values would be fetched, and the values requested by the selector returned.

Key-order in the tree would need to be subject to performance testing and analysis to see what the most effective descent order would be. It would probably not be possible for users to define their own indexes and key orders; if they could, Prometheus would then need some kind of stats-driven query planner to decide which to pick, and that's way too big a job.

There are various complex trade-offs in performance and storage with different tree structures and orders; e.g. whether the tree should use the string-label name as the top key (resulting in much faster searches for a specific string-label, but much slower searches for all string-labels for a series) or the identifying-labels digest should be the key. Whether the

The label-set to be included in the identifying-labels digest could also be specified as an ingestion-time meta-label, then stored as a label for the series in the normal TSDB so it could be consulted when searching the external labels tree.

In an actual storage implementation this deep tree might be indirected for efficiency, with the b+tree pointing to pages in a heap containing extern'd values rather than the values themselves. And the on-disk implementing data structure might not be a b+-tree.

Problems

Thinking through it I found some problems that convince me that, even if the TSDB storage issues are disregarded, there are too many other parts of Prometheus that cannot gracefully cope with series that have highly-varying, potentially long-valued labels.

Lets say that this approach would permit highly varying string values could be stored and fetched efficiently, without the consequences of high label churn, long values and high cardinality on the TSDB and its indexes.

Prometheus still has no way to represent the results efficiently, nor any way to return a large result set without excessive working memory use.

Even if fetching foo{}[1h] where foo{} has some long-valued varying label is made more efficient on disk and for index searches, prometheus is going to OOM and crash too easily when accumulating a result to send to the client. The result is first going to be multipled in-memory into a wide set of sparse range-series foo{valuelabel="a"}, foo{valuelabel="b"} when fetching the value-labels. Then marshalled into a single response payload, which must also fit in memory at the same time as the whole exploded result-series, including all the duplicates of all the identifying label key/value pairs. I don't see it working well.

To make this work well:

A means would be needed to represent a series in-memory with some of its labels pointing to some common-set of labels shared between many series, so that foo{}[1h] with a string-value-label that changes every sample would not have to duplicate job, instance, and every other identifying label for every sample in which any value changes.
The query executor would really want to be able to split up execution into batches with results that are progressively integrated after unwanted labels are discarded, etc. That's not something PromQL's rather simple executor can do right now, it does not have a dynamic query planner like a relational database does.
It would be desirable to be able to stream query results to the client in batched chunks, rather than accumulating them in memory and returning them as a single huge result message payload that can easily OOM Prometheus.

... and all of these would probably be really intrusive code changes.

But without those things, having efficient storage isn't going to help that much, Prometheus is just going to crash on nontrivial queries of the results.