Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: add initial symbol tables support #5906

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

GiedriusS
Copy link
Member

Add initial support for symbol tables; symbol tables are sent via hints that are used to deduplicate strings. The main benefit of doing this is to reduce network traffic && reduce number of allocations needed for all of the different strings.

Strings are referenced by their number. In total, there could be math.MaxUint64 strings. To avoid blocking in querier, symbol tables / references are automatically adjusted depending on which store it is in the stores slice. In other words, the whole math.MaxUint64 space is divided into equal number of strings for each StoreAPI. If that limit is breached then raw strings are sent instead. This is taken care of by the LookupTable builder. We ensure that old versions are compatible with this one by passing over the maximum number of allowed unique strings via StoreRequest. If that limit is zero then we disable the building of the lookup table.

This compression is beneficial in almost all cases. The worst case is when there are a lot of unique metrics with unique strings in each metric. However, I strongly believe that this will only happen 1% of the time due to the nature of monitoring. This is because a set of metrics will always have some identical dimensions i.e. same labels with only one or two changing.

I have also attempted an alternative implementation whereas a Label could be oneof compressed labels or the regular labels. However, the implementation wasn't quite as elegant, cumbersome. This is much cleaner.

For now, only streaming Sidecar + Thanos Store is implemented. We could add support for this in other components as a follow-up.

Signed-off-by: Giedrius Statkevičius giedrius.statkevicius@vinted.com

@yeya24
Copy link
Contributor

yeya24 commented Nov 18, 2022

Nice! Is there any benchmark for this?

@fpetkovski
Copy link
Contributor

I made a quick first pass, this looks pretty cool! I could not understand if the layered querier topology is covered as well. Is the table (indexing space) somehow recursively divided?

@saswatamcode
Copy link
Member

This looks really amazing! Would be interesting to see benchmarks too, especially with the lookup table bimap!

Add initial support for symbol tables; symbol tables are sent via hints
that are used to deduplicate strings. The main benefit of doing this is
to reduce network traffic && reduce number of allocations needed for all
of the different strings.

Strings are referenced by their number. In total, there could be
math.MaxUint64 strings. To avoid blocking in querier, symbol tables /
references are automatically adjusted depending on which store it is in
the stores slice. In other words, the whole math.MaxUint64 space is
divided into equal number of strings for each StoreAPI. If that limit is
breached then raw strings are sent instead. This is taken care of by the
LookupTable builder. We ensure that old versions are compatible with
this one by passing over the maximum number of allowed unique strings
via StoreRequest. If that limit is zero then we disable the building of
the lookup table.

This compression is beneficial in almost all cases. The worst case is
when there are a lot of unique metrics with unique strings in each
metric. However, I strongly believe that this will only happen 1% of
the time due to the nature of monitoring. This is because a set of
metrics will always have some identical dimensions i.e. same labels with
only one or two changing.

I have also attempted an alternative implementation whereas a `Label`
could be `oneof` compressed labels or the regular labels. However, the
implementation wasn't quite as elegant, cumbersome. This is much
cleaner.

For now, only streaming Sidecar + Thanos Store is implemented. We could
add support for this in other components as a follow-up.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
@GiedriusS
Copy link
Member Author

GiedriusS commented Nov 21, 2022

👋

I made a quick first pass, this looks pretty cool! I could not understand if the layered querier topology is covered as well. Is the table (indexing space) somehow recursively divided?

Each upper layer divides the uint64 space into equal parts for each StoreAPI. If any StoreAPI exceeds that limit then it errors out. Given that string references are unique, I think the function that I'm using should always remap all IDs properly.

I also thought about adding this behind a flag but is it realistic to have thousands and thousands of StoreAPIs that give millions of unique strings? 🤔 math.MaxUint64 is 18446744073709551615 😄 I don't think it should even be reached. An even then maybe we could bump to uint128.

What benchmarks would you like to see for this? If this code looks good to my team mates then I'll try this out in prod and see the results 👁️

@GiedriusS GiedriusS marked this pull request as ready for review November 21, 2022 12:01
@fpetkovski
Copy link
Contributor

I agree that it's hard to have a benchmark since a lot of benefits will come from reducing network bandwidth. So anything we run on a single machine will not be representative.

@saswatamcode
Copy link
Member

What benchmarks would you like to see for this? If this code looks good to my team mates then I'll try this out in prod and see the results

I was thinking benchmarking the compression/decompression might be useful so that if this is changed in the future, we can catch regressions! 🙂
But yes, it won't capture the network bandwidth benefits, so trying in prod sounds better for that.

@GiedriusS
Copy link
Member Author

Here's how network traffic dropped after deploying this:
image

Dropped by 3x. This is one of the servers that is only evaluating rules via Thanos Ruler so the effect is visible the most.

@fpetkovski
Copy link
Contributor

Was there any reduction in query latency?

@GiedriusS
Copy link
Member Author

Was there any reduction in query latency?

Doesn't look so, maybe just a little bit :( but with much smaller network usage, it's possible to stuff more recording/alerting rules into it

@GiedriusS
Copy link
Member Author

RAM usage also dropped around ~20% with no noticeable increase in CPU usage 🤔

image
image

@bwplotka
Copy link
Member

cc @bboreham - want to check it? You might have some thoughts. (:

@bboreham
Copy link
Contributor

Thanks for this; I have considered doing something like this in the remote-write protocol.
A question that occurs to me is "when is a symbol table discarded from memory?", but perhaps I didn't understand.

Re some conversations we had at PromCon, you may like grafana/mimir-prometheus#353.

Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, the implementation looks good to me and your prod test results seem persuasive 👀 , I have mostly small nits.

pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/lookup_table.go Outdated Show resolved Hide resolved
pkg/store/lookup_table.go Outdated Show resolved Hide resolved
pkg/store/lookup_table.go Outdated Show resolved Hide resolved
pkg/store/bucket.go Outdated Show resolved Hide resolved
pkg/store/tsdb.go Outdated Show resolved Hide resolved
@matej-g
Copy link
Collaborator

matej-g commented Nov 22, 2022

FWIW, idea somewhat related idea to others in this conversation - reusing strings in receiver #5899 - did not manage to finalize it yet though

@cstyan
Copy link
Contributor

cstyan commented Nov 24, 2022

Thanks @bboreham sending me this way.

@GiedriusS I have a similar implementation of a symbols table, based originally on the interner we already use in remote write. Let me know if you'd like to contribute yours upstream so that it could be reused in remote write.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
@stale stale bot added the stale label Jan 7, 2023
@GiedriusS GiedriusS removed the stale label Jan 9, 2023
@GiedriusS
Copy link
Member Author

Been running this for months, haven't spotted any problems. PTAL.

@fpetkovski
Copy link
Contributor

This looks good overall, I'll take another look next week to make sure we don't miss something.

pkg/query/querier.go Outdated Show resolved Hide resolved
pkg/store/storepb/rpc.proto Show resolved Hide resolved
It takes >17s. to decompress >80M elements. Doing that concurrently cuts
down the time to 3-4 seconds. We have lots of spare CPU so let's do
that.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
…port

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
@GiedriusS
Copy link
Member Author

Updated PR - merged main && made changes according to comments. Also, while running this in prod I've noticed that decompression takes a long time with queries that retrieve millions of series hence I've added concurrent decompression support that is tunable via a new flag. Also, now this functionality is disabled unless that flag is set to a value bigger than zero.

I can't imagine running Thanos in prod without this anymore because Queriers would get easily overwhelmed - we have thousands of recording rules and the labels completely dominate the % of bandwidth.

Copy link
Contributor

@fpetkovski fpetkovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Any recommendations on what value to set the number of workers to?

@yeya24
Copy link
Contributor

yeya24 commented Feb 2, 2023

How much does this feature help, any dashboards/metrics to share? @GiedriusS 👀

Updated: NVM I see the improvements!

Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐ ⭐ ⭐

One more nit regarding flag from me, but I'll leave it up to you, the implementation looks good 🙌

@@ -111,6 +111,9 @@ func registerQuery(app *extkingpin.App) {
maxConcurrentSelects := cmd.Flag("query.max-concurrent-select", "Maximum number of select requests made concurrently per a query.").
Default("4").Int()

maxConcurrentDecompressWorkers := cmd.Flag("query.max-concurrent-decompress-workers", "Maximum number of workers spawned to decompress a set of compressed storepb.Series. Setting this to higher than zero enables label compression during querying - CPU usage will be slightly higher whilst network usage will be significantly lower.").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is a tiny bit confusing to me - it sounds like it configures the number of decompress workers, but in fact this also turns on / off the compression itself. It either sounds like two distinct flags would be better or somehow these configs should be combined, if it can be, in a reasonable way so that the flag name reflects it (but I struggle to find one).

Also I wonder if do not want to put this behind hidden flag straightway, a bit more documentation would be nice (but can be done in a separate PR).

@matej-g
Copy link
Collaborator

matej-g commented Mar 3, 2023

@GiedriusS anything still blocking this? Or you just need to resolve conflicts?

@GiedriusS
Copy link
Member Author

This will need to be reworked a bit after the latest changes because order became important now. It means that we need to somehow merge different symbol references from multiple StoreAPIs in such a way as to preserve order. I don't know how to do that, at the moment. The references will need to somehow be encoded in such a way as to preserve order.

@stale
Copy link

stale bot commented May 21, 2023

Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days.
Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 21, 2023
@stale
Copy link

stale bot commented Jun 18, 2023

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this Jun 18, 2023
@yeya24
Copy link
Contributor

yeya24 commented Jun 18, 2023

Do we still want to continue this work?

@yeya24 yeya24 reopened this Jun 18, 2023
@MitchellJThomas
Copy link

+1 for continuing the work

@stale stale bot removed the stale label Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants