-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Workload Management] QueryGroup Resource Cancellation #15151
[Workload Management] QueryGroup Resource Cancellation #15151
Conversation
❌ Gradle check result for 1a27813: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 3c77d47: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 3d11693: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 8dd0cfa: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for cbb51bd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❕ Gradle check result for 4e846e2: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@kiranprakash154 Can we fix precommit
and assemble
errors ?
.getResourceUsageData(); | ||
|
||
for (ResourceType resourceType : TRACKED_RESOURCES) { | ||
if (queryGroup.getResourceLimits().containsKey(resourceType) && queryGroupResourceUsage.containsKey(resourceType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think right side of this && operator is not required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its just a check to proceed with isBreachingThreshold
only if the querygroup has limits set on resourceType and there is usage recorded for the usage type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true but it is unnecessary because the queryGroupResourceUsage
map contains resource usage for all tracked resources
server/src/main/java/org/opensearch/wlm/cancellation/DefaultTaskSelectionStrategy.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
4e846e2
to
80dd918
Compare
❌ Gradle check result for 80dd918: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Kiran Prakash <awskiran@amazon.com>
❌ Gradle check result for a469e33: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 3a5c03a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
// Get the total CPU time of the process in milliseconds | ||
long cpuTotalTimeInMillis = ProcessProbe.getInstance().getProcessCpuTotalTime(); | ||
double nodeLevelCancellationThreshold = this.workloadManagementSettings.getNodeLevelCpuCancellationThreshold() | ||
* cpuTotalTimeInMillis; | ||
// Check if resource usage is breaching the threshold | ||
threshold = (long) (resourceThresholdInPercentage * nodeLevelCancellationThreshold); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct way, since we are taking process's total CPU time which could be well lets say in hours/days/months magnitude while on the other the hand, tasks may only be running since last couple milliseconds or even minutes in worst case before the request times out.
This will not refelect actual nano/milli seconds worth of a percentage. We are also not considering the number of processors available in this calculation as well
Closing this PR in favor of : #15651 |
Description
Includes the Resource Cancellation framework changes for Workload management, this is an extension to #13897
Test coverage
Related Issues
Resolves - #14883
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.