Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate label compression #5870

Open
GiedriusS opened this issue Nov 7, 2022 · 4 comments
Open

Investigate label compression #5870

GiedriusS opened this issue Nov 7, 2022 · 4 comments

Comments

@GiedriusS
Copy link
Member

GiedriusS commented Nov 7, 2022

Is your proposal related to a problem?

We currently send label names/values are bare strings in each Series() call. Even with gRPC compression turned on, I think compression compresses each streamed response but not all of it:

The compression supported by gRPC acts at the individual message level, taking message as defined in the wire formatdocument".

(https://grpc.github.io/grpc/core/md_doc_compression.html)

Parca uses deduplicated string table for compression parca-dev/parca#1976 and got good results so perhaps we could use the same idea. Also, Prometheus TSDB has the same idea: it interns all strings so they are not repeated: https://github.com/prometheus/prometheus/blob/main/tsdb/docs/format/index.md#symbol-table.

Describe the solution you'd like

Create a lookup table while sending back Series responses. Send the lookup table to the client at the end.

Note, that because this kind of compression applies to the whole response, it means that it would be no longer possible to make a fully streamed Select() call for the PromQL engine. An alternative would be not to send it at the end of the whole stream but then there would be more roundtrips.

Describe alternatives you've considered

N/A

Additional context

https://cloud-native.slack.com/archives/CL25937SP/p1667349702590129

@GiedriusS
Copy link
Member Author

GiedriusS commented Nov 7, 2022

Did a small experiment on the querier's side with a typical user query that we get here. Only a very small set of hostnames and pod names are unique, all of the other values repeat more than once. Some strings repeat even 12k, 8k times 😱

Added a total sum of all strings in that query and the sum of all strings only once. Here are some results:

Total sum is 1179630, total minimal is 26431
Total sum is 1468650, total minimal is 32283
Total sum is 1601698, total minimal is 33521
Total sum is 1331910, total minimal is 28607
Total sum is 1609175, total minimal is 33692
Total sum is 1307388, total minimal is 27434                                                                                                                                                               

(https://gist.github.com/GiedriusS/aada711443326ea452ec5c4c0c508b07)

In this test, chunks data takes up around ~130KB. This means, that for each StoreAPI, we can save around ~97% of traffic used just for sending labels. In total, reduction would be around 90%. I have tested with an instant query. I guess that with a range query and constant labels this gain would be smaller because more chunks data would need to be sent.

Only caveat that I can think of is that we either won't be able to have a streaming Select() or there would be more round-trips if we were to stream the lookup table gradually as it is built up.

@yeya24
Copy link
Contributor

yeya24 commented Nov 9, 2022

@GiedriusS I think this also applies to the Query Frontend side.
Right now, the query frontend sends gRPC requests to downstream queriers and merge the responses in a blocking way. We could deduplicate labels there.
Just the gain is not that much compared to Select as most of the query results are aggregated already.

Edit: actualy seems we are not using the gRPC query API now. It is still the HTTP API. 😢 I am wondering if we have any plan to use that gRPC query API, right now it is not used anywhere

@fpetkovski
Copy link
Contributor

Dictionary encoding would definitely be an awesome addition. One thing that might not be as straightforward is proxying encoded series through a querier that is used as a gRPC proxy. The simples way would be to decode and re-encode all series in this middleware querier using a union of all received maps. But that will require blocking and buffering. Maybe there is a way to make the encoding composable so that only the root querier has to do the decoding.

Regarding the gRPC Query API, I added this because I thought it will be needed for pushdown and sharding. But in the end, we were able to do everything using the existing HTTP API. It would be nice to start using the gRPC one, but if the migration is too complex or if the API is hard to maintain, I would also be okay with deprecating it.

@GiedriusS
Copy link
Member Author

Yeah, I was thinking that if there was a function like:

string_mapper(label name/value) -> signature

That would work identically across multiple StoreAPIs, we could compose responses from multiple StoreAPIs. An ordinary hashing function would work but the hashes could become longer than the original strings. Perhaps the function could leverage the fact that label names/values have strict requirements i.e. it's not full Unicode space. I have tried googling for a bit but haven't seen any good function that could work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants