Add context and logic to querier scheduler_processor to wait for query completion before stopping #1742

treid314 · 2022-04-21T22:23:30Z

What this PR does:

This PR will wait for a querier to process any received and inflight queries before stopping. This is to prevent queries from using additional resources when a query is reschelded to a different querier when a querier that was executing an inflight query and is stopping due to a Kubernetes action like auto-scaling or a deploy. This could impact the time taken for a querier to stop if a long query is currently executing.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…fore stopping

CLAassistant · 2022-04-21T22:23:40Z

All committers have signed the CLA.

treid314 · 2022-04-21T22:26:27Z

Remaining Work

Update CHANGELOG
Make sure to think about stuck queries and timeouts
Add unit or integration tests

pstibrany · 2022-04-22T12:33:49Z

This PR will wait for a querier to process any received and inflight queries before stopping.

This PR does two changes:

it stops propagation of cancellation signal in context passed to processQueriesOnSingleStream, and let the currently handled query finish.
it also stops fast-propagation of stream error.

Problem with 1 is that if querier is not receiving any queries from scheduler, this may block forever (or until stopped externally).

Re 2: Query-scheduler will actually close the stream when it is asked to cancel the in-flight request. (See

mimir/pkg/scheduler/scheduler.go

Line 436 in 8f8f2c0

    
           // If the upstream request is cancelled (eg. frontend issued CANCEL or closed connection),

) After this PR, querier will now no longer react on this, and will keep running the request. I'm not sure if ignoring cancellation request (propagated via closed stream) is intended.

This is to prevent queries from using additional resources when a query is reschelded to a different querier when a querier that was executing an inflight query and is stopping due to a Kubernetes action like auto-scaling or a deploy.

I don't quite understand the reasoning here. Can you please provide an example of what's happening?

treid314 · 2022-04-22T20:06:18Z

The general idea here is if we are using an HPA/Operator to auto-scale queriers, when we would scale down queriers there could be a query still running on a querier that is getting stopped by the HPA/Operator that was close completion. If we stop a querier with an inflight query, the query will go to a different querier and use that new queriers resources when if we would have waited for the query to complete on the original querier we would not have needed the resources on the second querier and the original querier.

However, point 1 and 2 are very correct. With this logic change we remove the ability to cancel queries that are currently running, for example: clicking cancel on the grafana frontend. And, we would never be able to stop a querier that had never received a query to enter the querier loop function.

Therefor, I'm closing this draft and we'll go back to the drawing board on how we want to solve this issue.

Add context to querier scheduler to wait for open query completion be…

d63e759

…fore stopping

treid314 requested review from pstibrany and pracucci April 21, 2022 22:23

treid314 requested a review from aldernero April 21, 2022 22:23

treid314 added the component/querier label Apr 21, 2022

treid314 closed this Apr 22, 2022

pracucci mentioned this pull request Apr 26, 2022

Gracefully shutdown querier when using query-scheduler #1756

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add context and logic to querier scheduler_processor to wait for query completion before stopping #1742

Add context and logic to querier scheduler_processor to wait for query completion before stopping #1742

treid314 commented Apr 21, 2022

CLAassistant commented Apr 21, 2022 •

edited

Loading

treid314 commented Apr 21, 2022

pstibrany commented Apr 22, 2022 •

edited

Loading

treid314 commented Apr 22, 2022

Add context and logic to querier scheduler_processor to wait for query completion before stopping #1742

Add context and logic to querier scheduler_processor to wait for query completion before stopping #1742

Conversation

treid314 commented Apr 21, 2022

What this PR does:

Which issue(s) this PR fixes or relates to

Checklist

CLAassistant commented Apr 21, 2022 • edited Loading

treid314 commented Apr 21, 2022

pstibrany commented Apr 22, 2022 • edited Loading

treid314 commented Apr 22, 2022

CLAassistant commented Apr 21, 2022 •

edited

Loading

pstibrany commented Apr 22, 2022 •

edited

Loading