Description
Is your feature request related to a problem? Please describe.
At AWS, we want to provide our customers and our operators visibility into the resource cost of queries. We believe there are two kinds of statistics:
- Semantic: statistics that semantically remain the same whether cached or not. For example, the number of series or samples that contributed to the result.
- Runtime: statistics about the specific query run. For example, the time the query actually took this run, or the number of samples that were (not) cached.
Today, statistics are limited to timing. They're also only available for instant queries because of Cortex's range query caching system.
We specifically want to expose sample counts, series counts, and time for both range and instant queries, which we're working on upstream in Prometheus. However, we think it's entirely plausible that the community will find more useful statistics to expose, so we want to make that easy.
Describe the solution you'd like
Assuming our Prometheus work goes forward, the Prometheus engine will support recording sample and series stats, for either the query as a whole or by step in a range query. The statistics struct will be extensible.
In this work, we propose integrating that work into Cortex, exposing those new statistics for instant and range queries. We'd like to extend the extent cache to record stats by step so that partially-cached queries can correctly report their semantic statistics. We'll also expose timing statistics for range queries. We will ensure that it's as easy as possible to add a new kind of statistic in Cortex, ideally by identifying another useful stat as a part of this work.
Describe alternatives you've considered
We've thought about implementing this purely in Prometheus, but the interaction between cached extents and semantic statistics means that we think we need to change Cortex too.
Additional context
prometheus/prometheus#10181 is the corresponding Prometheus issue.
A question for maintainers and the community: are there other statistics that would be useful for us to capture in Cortex specifically?