Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context and logic to querier scheduler_processor to wait for query completion before stopping #1742

Closed
wants to merge 1 commit into from

Conversation

treid314
Copy link
Contributor

What this PR does:

This PR will wait for a querier to process any received and inflight queries before stopping. This is to prevent queries from using additional resources when a query is reschelded to a different querier when a querier that was executing an inflight query and is stopping due to a Kubernetes action like auto-scaling or a deploy. This could impact the time taken for a querier to stop if a long query is currently executing.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@treid314 treid314 requested review from pstibrany and pracucci April 21, 2022 22:23
@CLAassistant
Copy link

CLAassistant commented Apr 21, 2022

CLA assistant check
All committers have signed the CLA.

@treid314
Copy link
Contributor Author

Remaining Work

  • Update CHANGELOG
  • Make sure to think about stuck queries and timeouts
  • Add unit or integration tests

@pstibrany
Copy link
Member

pstibrany commented Apr 22, 2022

This PR will wait for a querier to process any received and inflight queries before stopping.

This PR does two changes:

  1. it stops propagation of cancellation signal in context passed to processQueriesOnSingleStream, and let the currently handled query finish.
  2. it also stops fast-propagation of stream error.

Problem with 1 is that if querier is not receiving any queries from scheduler, this may block forever (or until stopped externally).

Re 2: Query-scheduler will actually close the stream when it is asked to cancel the in-flight request. (See

// If the upstream request is cancelled (eg. frontend issued CANCEL or closed connection),
) After this PR, querier will now no longer react on this, and will keep running the request. I'm not sure if ignoring cancellation request (propagated via closed stream) is intended.

This is to prevent queries from using additional resources when a query is reschelded to a different querier when a querier that was executing an inflight query and is stopping due to a Kubernetes action like auto-scaling or a deploy.

I don't quite understand the reasoning here. Can you please provide an example of what's happening?

@treid314
Copy link
Contributor Author

The general idea here is if we are using an HPA/Operator to auto-scale queriers, when we would scale down queriers there could be a query still running on a querier that is getting stopped by the HPA/Operator that was close completion. If we stop a querier with an inflight query, the query will go to a different querier and use that new queriers resources when if we would have waited for the query to complete on the original querier we would not have needed the resources on the second querier and the original querier.

However, point 1 and 2 are very correct. With this logic change we remove the ability to cancel queries that are currently running, for example: clicking cancel on the grafana frontend. And, we would never be able to stop a querier that had never received a query to enter the querier loop function.

Therefor, I'm closing this draft and we'll go back to the drawing board on how we want to solve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants