Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workload Management] QueryGroup Stats API Logic #15777

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ruai0511
Copy link
Contributor

@ruai0511 ruai0511 commented Sep 5, 2024

Description

This change introduces the QueryGroup Stats API as part of this RFC: #12342.

The QueryGroup Stats API schema is:

curl -XGET "localhost:9200/_wlm/stats"

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Sep 5, 2024

❌ Gradle check result for f647487: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 5, 2024

❌ Gradle check result for 88c9cff: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Sep 5, 2024

❌ Gradle check result for b66823c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ruai0511 ruai0511 added the backport 2.x Backport to 2.x branch label Sep 5, 2024
Copy link
Contributor

github-actions bot commented Sep 6, 2024

❌ Gradle check result for 02cf095: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

}

@Override
protected List<QueryGroupStats> readNodesFrom(StreamInput in) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this send the node_id along with the QueryGroupStats

Copy link
Contributor Author

@ruai0511 ruai0511 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns List<QueryGroupStats> but we can use queryGroupStats.getNode().getId() to get node_id

Comment on lines 87 to 94
* Inner QueryGroupStatsRequest
*
* @opensearch.experimental
*/
public static class NodeQueryGroupStatsRequest extends TransportRequest {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this as this wrapper is not doing anything additional and the underlying object type is already TransportRequest ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following the structure of TransportNodesStatsAction. NodeQueryGroupStatsRequest is used as a per-node action handler inside Transport Action, and QueryGroupStatsRequest represents the action for all nodes as a whole. The NodeQueryGroupStatsRequest serves as the container for the request sent to each node, allowing the framework to serialize and execute that request on the individual nodes.

Comment on lines +326 to +332
* QueryGroup stats of the cluster.
*
* @param request The QueryGroupStatsRequest
* @param listener A listener to be notified with a result
*/
void queryGroupStats(QueryGroupStatsRequest request, ActionListener<QueryGroupStatsResponse> listener);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we add this method in the ClusterAdmin interface ? I think this API should be accessible to all so does it make sense to add to AbstractClient since NodeClient extends it.

Copy link
Contributor Author

@ruai0511 ruai0511 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the actual implementation is in AbstractClient . I see nodeStats is also following this pattern (abstract function in ClusterAdminClient.java and implementation in AbstractClient.java)

*/
public class QueryGroupService {
// This map does not need to be concurrent since we will process the cluster state change serially and update
// this map with new additions and deletions of entries. QueryGroupState is thread safe
private final Map<String, QueryGroupState> queryGroupStateMap;
private final TransportService transportService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rely on just a Supplier<String> as we only need nodeId here

Copy link
Contributor

❌ Gradle check result for caec681: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 6144198: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ruai0511 ruai0511 force-pushed the querygroupstats-PR branch 2 times, most recently from a8bb3a5 to 5240274 Compare September 10, 2024 20:02
Copy link
Contributor

❕ Gradle check result for a8bb3a5: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❕ Gradle check result for 5240274: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Sep 10, 2024

Codecov Report

Attention: Patch coverage is 76.31579% with 27 lines in your changes missing coverage. Please review.

Project coverage is 71.82%. Comparing base (f8515c7) to head (41aadb5).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ction/admin/cluster/RestQueryGroupStatsAction.java 36.36% 7 Missing ⚠️
...tion/admin/cluster/wlm/QueryGroupStatsRequest.java 66.66% 6 Missing ⚠️
...in/cluster/wlm/TransportQueryGroupStatsAction.java 77.77% 4 Missing ⚠️
...ain/java/org/opensearch/wlm/QueryGroupService.java 78.94% 0 Missing and 4 partials ⚠️
...ion/admin/cluster/wlm/QueryGroupStatsResponse.java 90.00% 2 Missing ⚠️
.../org/opensearch/client/support/AbstractClient.java 0.00% 2 Missing ⚠️
...ch/wlm/WorkloadManagementTransportInterceptor.java 80.00% 0 Missing and 1 partial ⚠️
...java/org/opensearch/wlm/stats/QueryGroupStats.java 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15777      +/-   ##
============================================
- Coverage     71.89%   71.82%   -0.07%     
- Complexity    64211    64227      +16     
============================================
  Files          5272     5277       +5     
  Lines        300578   300670      +92     
  Branches      43432    43436       +4     
============================================
- Hits         216086   215959     -127     
- Misses        66707    66945     +238     
+ Partials      17785    17766      -19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sgup432
Copy link
Contributor

sgup432 commented Sep 10, 2024

@ruai0511 @kaushalmahi12
I guess the node stats URI should _wlm/query_group/stats? As it looks more appropriate and also aligns with what we discussed in the stats meta issue. And allows for further extension in case we want.

@ruai0511
Copy link
Contributor Author

ruai0511 commented Sep 10, 2024

@ruai0511 @kaushalmahi12 I guess the node stats URI should _wlm/query_group/stats? As it looks more appropriate and also aligns with what we discussed in the stats meta issue. And allows for further extension in case we want.

If we use GET _wlm/query_group/stats, it'll create confusion between getting the stats and getting a querygroup named stats as the path would be the same. So if we want to keep _wlm, we probably need to change it to _wlm/query_group_stats

Copy link
Contributor

❌ Gradle check result for 9bbc037: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@sgup432
Copy link
Contributor

sgup432 commented Sep 10, 2024

If we use GET _wlm/query_group/stats, it'll create confusion between getting the stats and getting a querygroup named stats as the path would be the same. So if we want to keep _wlm, we probably need to change it to _wlm/query_group_stats

I think it is upto user what kind of naming they want to use. In that case they can also use query_group_stats as a name which is weird and is also confusing in itself with current URI structure.

I think having _wlm/<metric>/<name>/stats allows for better URI structure and is more consistent with the existing APIs.
If someone wants to use stats as a name so be it, they can still fetch it via _wlm/query_group/stats/stats which doesn't break the API from logical point of view.

Copy link
Contributor

✅ Gradle check result for 41aadb5: SUCCESS

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>
Copy link
Contributor

❌ Gradle check result for 9714571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Ruirui Zhang <mariazrr@amazon.com>

public QueryGroupService() {
this(new HashMap<>());
@Inject
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is @Inject needed here since we are creating the instance in Node.java

Copy link
Contributor

❌ Gradle check result for 2df37a3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants