Shared caching layer for thanos queriers #5047
Labels
component: query
difficulty: hard
dont-go-stale
Label for important issues which tells the stalebot not to close them
feature request/improvement
GSoC/Community Bridge/LFX
Is your proposal related to a problem?
We are running about 10 Thanos querier replicas for scaling purposes and we have 100+ sidecar + prometheus edge clusters across the world.
For our setup, the fanout problem is huge because of the scale. For example:
Info requests to sidecars
This is not a big problem because Info Request and Response are relatively cheap. In our setup, (number of queriers x number of sidecars) requests are sent every time. It is okay when scale is small. However, when you have more and more Thanos Queriers and edge sidecars, this is not very efficient.
metadata and rules query requests to sidecars
metrics metadata and rules query is something hardly changed for us. Especially metrics metadata. This is where caching would benefit us a lot.
more use case in the future
From #1611, we proposed to have some bloom filter like datastructure for reducing unnecessary series calls. Ideally, this could be done by introducing more data reported from the Info API and keep a bloom filter in queriers. If we can have a caching layer for the querier clusters then keeping the bloom filter up-to-date is not that expensive anymore.
Describe the solution you'd like
Have another type of cache for this use case. Maybe call it
proxy cache
? It is similar to caching bucket but this time we cache endpoint responses.Also I think the new galaxy cache is very suitable for this usecase.
Describe alternatives you've considered
Have some kind of gRPC proxy to do caching/passthrough based on the requests. I don't do any investigation right now but maybe something already suits my usecase.
The text was updated successfully, but these errors were encountered: