queryfrontend 20% latency overhead #4571

bwplotka · 2021-08-17T10:21:21Z

We see that on first query range queries (cold cache) can take extra 20% of time if they go through query-frontend query range.

AC:

We understand the bottlenecks
We have micro benchmark for this path
Ideally: We fix latency. It should not be 20% on ~1m query IMO

cc @matej-g

GiedriusS · 2021-08-17T11:44:10Z

I was thinking that maybe our HTTP client isn't honoring keep-alive properly. I believe this is the relevant place where the request is handled: https://github.com/cortexproject/cortex/blob/master/pkg/frontend/transport/handler.go#L119. Isn't this missing a io.Copy(ioutil.Discard, r.Body) just like we do in other places (https://awmanoj.github.io/tech/2016/12/16/keep-alive-http-requests-in-golang/)?

GiedriusS · 2021-08-18T12:16:39Z

Yes, so that is the problem albeit in a different place + the default http.DefaultTransport is used which is quite limited in the number of idle connections: https://github.com/cortexproject/cortex/blob/master/pkg/frontend/downstream_roundtripper.go#L39. Let me try to fix this.

matej-g · 2021-08-18T14:48:24Z

@GiedriusS wow, impressively fast action 😁. Once this lands in Cortex I'll be happy to poke around some more and see how much of improvement we get over the latency mentioned in the description that we are currently observing on our side.

Make the HTTP downstream tripper configurable so that it would be possible to set the number of maximum idle connections per host to more than 2 so that HTTP keep-alive connections could be used on higher loads. Solves thanos-io#4571. Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* query-frontend: make HTTP downstream tripper configurable Make the HTTP downstream tripper configurable so that it would be possible to set the number of maximum idle connections per host to more than 2 so that HTTP keep-alive connections could be used on higher loads. Solves #4571. Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com> * CHANGELOG: add entry Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

GiedriusS · 2021-09-06T10:48:39Z

This should be fixed with #4623 and setting max_idle_conns_per_host to a high number such as 100. @matej-g could you please try this out?

matej-g · 2021-09-13T08:07:42Z

Hey @GiedriusS,
So I did a number of 'macro' tests last week (comparable to our original latency report). I compared latest main with v0.22.0. I did tests both with the interactive example locally (while spinning up 2 different versions of query frontend) and on a test cluster on AWS. In all cases I used the same data, generated by the interactive example via thanosbench tool (this created ~2 weeks worth of data).
I also tested a number of different settings (with cache enabled / disabled; with split interval 24h / 1h / disabled). The conclusion is however that I could not see any discernible difference between the previous version and the version with configurable roundtripper (while using the recommended setting with 100 max idle connections per host). Queries took roughly the same time on both instance of query frontend in all cases.

However, one interesting finding is about the overhead of doing query splitting / 'stitching' it back together. Below I'm pasting some numbers and queries I ran, but it seems that on a 1 week range with the default 24h splitting, the response time can be as much as 2-3x higher in comparison to going directly through query (and the overhead seems to increase with the range).

EDIT: OK, I think this was caused by the fact that the interactive tests overrides the default for max concurrent queries to 1 which I wasn't aware of, which was causing the increase wait time for split queries.

GiedriusS · 2021-09-15T15:29:37Z

Hey @GiedriusS,
So I did a number of 'macro' tests last week (comparable to our original latency report). I compared latest main with v0.22.0. I did tests both with the interactive example locally (while spinning up 2 different versions of query frontend) and on a test cluster on AWS. In all cases I used the same data, generated by the interactive example via thanosbench tool (this created ~2 weeks worth of data).
I also tested a number of different settings (with cache enabled / disabled; with split interval 24h / 1h / disabled). The conclusion is however that I could not see any discernible difference between the previous version and the version with configurable roundtripper (while using the recommended setting with 100 max idle connections per host). Queries took roughly the same time on both instance of query frontend in all cases.

However, one interesting finding is about the overhead of doing query splitting / 'stitching' it back together. Below I'm pasting some numbers and queries I ran, but it seems that on a 1 week range with the default 24h splitting, the response time can be as much as 2-3x higher in comparison to going directly through query (and the overhead seems to increase with the range).

EDIT: OK, I think this was caused by the fact that the interactive tests overrides the default for max concurrent queries to 1 which I wasn't aware of, which was causing the increase wait time for split queries.

Mhm, is it reproducible when more maximum concurrent queries are permitted?

matej-g · 2021-09-16T14:51:27Z

It's not reproducible with higher number of concurrent queries (e.g. the default 4 or higher) - with that configuration, the times are pretty similar for both query and query frontend.

I am actually now unsure about how precise the original report we received about those 20% was and whether there might have been something else going on that would explain the difference. But from what I can conclude from my tests, there is no perceivable difference for the end user regardless of whether doing query through frontend or directly on query (with the obvious exception of the case when the operator sets query to handle only 1 concurrent query, which introduces the wait time for split queries). I will clarify with the reporter of the issue once again to wrap this up and ensure I did not miss anything.

matej-g · 2021-10-01T12:38:04Z

So as I said in my findings above I could not confirm the issue with 20% latency increase on query frontend, this is for me the conclusion of the issue.

As for the rest of the task, I opened a PR to add some benchmarks to query frontend, so feel free to have a look - #4723

…4623) * query-frontend: make HTTP downstream tripper configurable Make the HTTP downstream tripper configurable so that it would be possible to set the number of maximum idle connections per host to more than 2 so that HTTP keep-alive connections could be used on higher loads. Solves thanos-io#4571. Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com> * CHANGELOG: add entry Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

GiedriusS · 2021-11-12T15:10:14Z

@matej-g let's close this, then? (:

matej-g · 2021-11-12T15:11:24Z

Ah, right 🙂 Please close it

GiedriusS mentioned this issue Aug 18, 2021

query-frontend: poor keep-alive connection reusage cortexproject/cortex#4428

Closed

GiedriusS added component: query-frontend difficulty: easy feature request/improvement labels Aug 18, 2021

GiedriusS mentioned this issue Sep 1, 2021

query-frontend: make HTTP downstream tripper configurable #4623

Merged

matej-g mentioned this issue Oct 1, 2021

Query frontend: Add benchmarks for labels codec and query range codec #4723

Merged

2 tasks

GiedriusS closed this as completed Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

queryfrontend 20% latency overhead #4571

queryfrontend 20% latency overhead #4571

bwplotka commented Aug 17, 2021

GiedriusS commented Aug 17, 2021 •

edited

Loading

GiedriusS commented Aug 18, 2021

matej-g commented Aug 18, 2021

GiedriusS commented Sep 6, 2021

matej-g commented Sep 13, 2021 •

edited

Loading

GiedriusS commented Sep 15, 2021

matej-g commented Sep 16, 2021 •

edited

Loading

matej-g commented Oct 1, 2021

GiedriusS commented Nov 12, 2021

matej-g commented Nov 12, 2021

queryfrontend 20% latency overhead #4571

queryfrontend 20% latency overhead #4571

Comments

bwplotka commented Aug 17, 2021

GiedriusS commented Aug 17, 2021 • edited Loading

GiedriusS commented Aug 18, 2021

matej-g commented Aug 18, 2021

GiedriusS commented Sep 6, 2021

matej-g commented Sep 13, 2021 • edited Loading

GiedriusS commented Sep 15, 2021

matej-g commented Sep 16, 2021 • edited Loading

matej-g commented Oct 1, 2021

GiedriusS commented Nov 12, 2021

matej-g commented Nov 12, 2021

GiedriusS commented Aug 17, 2021 •

edited

Loading

matej-g commented Sep 13, 2021 •

edited

Loading

matej-g commented Sep 16, 2021 •

edited

Loading