Description
Hi 👋
A month ago @tomwilkie merged a PR that makes query-frontend
capable to cache responses for the queries against any Prometheus API. Details were presented at Prometheus London Meetup:
- Slides: https://speakerdeck.com/grafana/blazin-fast-promql
- Watch the talk here: https://youtu.be/eyBbImSDOrI
Now, this is amazing piece of work as it allows simple and clear Cortex response caching (with days splitting!) to be used against any Prometheus-based backend. Requests against metric backends are often expensive, have small result output, are simultaneous and repetitive, so it makes sense to treat such caching component as must-have - even for vanilla Prometheus. As the Thanos maintainers we were looking exactly for something like this for some time. Overall it definitely looks like both Cortex and Thanos are looking to solve a very similar goal.
From Thanos side we want to make it a default caching solution that we want to recommend, document and maintain.
However, still, such caching is heavily bound to Cortex. It has quite a complex Queuing engine that already was proposed to be extracted from caching. I believe that splitting caching into a separate project (promcache
?), in some common org like https://github.com/prometheus-community can have many advantages around contributing, clarity and adoption. I enumerated some benefits further down.
Proposal
- Move
query-frontend
caching logic to separate Go module (plus cmd to run it) e.g https://github.com/prometheus-community/promcache- Name of the project is to be defined ( :
- Add maintainers who want to help from both Cortex and Thanos as the project owners.
- Make it clear that this is a caching project for Prometheus API, Cortex, and Thanos backends.
- Open questions:
- What if other backends want something extra? VM, M3DB?
- Should we embed retries and limits as well? (IMO yes)
- Open questions:
- Allow Cortex to use it either as a library in
query-frontend
or just point toquery-frontend
(without caching) - Allow Thanos to use it as a library in Querier (potentially) or spin up on top of Querier (must-have)
If we agree on this, we (Thanos team) are happy to spin this project up, prepare repo, go module, initial docs and extract caching logic from query-frontend. Then we can focus on embedded caching in existing components like Querier or Query-frontend and use promcache
as a library if needed.
Benefits of moving caching part of query-frontend
into a separate project?
- Share responsibility for maintaining
promcache
across both Thanos and Cortex teams. - More focused project! (caching, retries, limits around Prometheus Query APIs)
- Easier to understand, easier collaboration, documentation, starting up
- Separate versioning
- Easier to use as a library (fewer deps)
- Easier to justify adjustments for Cortex & Thanos:
- While some logic is common, there might some separate changes required for Cortex and Thanos.
- Cortex: QoS, queueing, multitenancy;
- Thanos: splitting by different ranges than days when using downsampled data, partial response logic etc
- The first step to join forces and the collaboration between Cortex & Thanos!
- Space to agree on common queuing API inspired by Cortex that might be useful for Thanos or even vanilla Prometheus
- Space to agree on multi-tenancy, QoS, retry, limits mechanisms together ❤️
- Beneficial for Cortex itself:
- Scaling caching frontend, separate to the queuing: Query Frontend scalability #1150
What could be missing in the current query-frontend
caching layer?
- Client load balancing for downstream API
- E.g In Kubernetes it’s hard to equally (round-robin) load balance the Queriers
- Adjustments for Thanos as mentioned above.
- Caching other Prometheus APIs (label/values, series)
- Other caching backends
Thanks, @gouthamve for the input so far!
cc @bboreham @tomwilkie and others (: What do you think?