Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to dynamically resize threadpools size #16236

Merged
merged 5 commits into from
Oct 15, 2024

Conversation

gbbafna
Copy link
Collaborator

@gbbafna gbbafna commented Oct 8, 2024

Description

Adds capability to change thread pool sizes for all threadpools defined in core

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Oct 8, 2024
@gbbafna gbbafna changed the title Dynamic threadpool Add support to dynamically resize threadpools size Oct 8, 2024
Copy link
Contributor

github-actions bot commented Oct 8, 2024

❌ Gradle check result for 71df521: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Oct 8, 2024

❌ Gradle check result for 8d75efd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I love the idea to tune the threadpool size. This will give end users the option to get the best out of their opensearch cluster.

@jed326
Copy link
Collaborator

jed326 commented Oct 10, 2024

Thanks @gbbafna! I like the idea of dynamically tuning the threadpool size but I am worried there are some sharp edges for users with a cluster setting.

Today the threadpool size can be set per-node via the opensearch.yml, so in cases where there is a mix of hardware type across the cluster each node can have it's own configuration. This is pretty typical for the cluster manager nodes and the data nodes to be of different instance types, so the concern is that we could accidentally adversely affect one instance type by setting the cluster wide setting.

For example, if we increase the snapshots threadpool size across the cluster, that might improve the performance of taking a snapshot on the data nodes, but have adverse effects on cloning a snapshot on the cluster manager nodes.

@dbwiddis
Copy link
Member

Overall, I love the idea to tune the threadpool size. This will give end users the option to get the best out of their opensearch cluster.

I love the idea of tuning, but that also just adds a lot more complexity that barely any of us understand! What should it be?

Giving one threat pool more resources should ideally be balanced by a reduction in another, but how do we know? Is there any way to have an overall "max threads across all these different pools" constraint to let people tweak up one pool as long as they constrain others? Are there stats that let us measure how many times a pool is maximized?

@gbbafna
Copy link
Collaborator Author

gbbafna commented Oct 11, 2024

Thanks @gbbafna! I like the idea of dynamically tuning the threadpool size but I am worried there are some sharp edges for users with a cluster setting.

Today the threadpool size can be set per-node via the opensearch.yml, so in cases where there is a mix of hardware type across the cluster each node can have it's own configuration. This is pretty typical for the cluster manager nodes and the data nodes to be of different instance types, so the concern is that we could accidentally adversely affect one instance type by setting the cluster wide setting.

For example, if we increase the snapshots threadpool size across the cluster, that might improve the performance of taking a snapshot on the data nodes, but have adverse effects on cloning a snapshot on the cluster manager nodes.

This are quite valid points @jed326, thanks for them . Definitely, there are sharp edges with this feature. We would recommend only expert users to tune this up . Also since its dynamic, users can easily revert it as well on observation of degradation. On a higher level , i would say we should have a different thread pool for cloning and create snapshot . Recently we introduced another threadpool for deletion as well to independently scale up and down.

Also i would put in the documentation to not use this if you have a heterogeneous cluster . Practically , I have seen most of the clusters as homogeneous only .

@gbbafna
Copy link
Collaborator Author

gbbafna commented Oct 11, 2024

Overall, I love the idea to tune the threadpool size. This will give end users the option to get the best out of their opensearch cluster.

I love the idea of tuning, but that also just adds a lot more complexity that barely any of us understand! What should it be?

Giving one threat pool more resources should ideally be balanced by a reduction in another, but how do we know?

Thanks @dbwiddis . I agree it is tricky to tune it and would document that only experts should tune it on a need basis and monitor for any degradation.

how do we know? - There would be some experimentation required to figure out what is best for a given configuration.

Is there any way to have an overall "max threads across all these different pools" constraint to let people tweak up one pool as long as they constrain others?

We can think about this if this will benefit our users .

Are there stats that let us measure how many times a pool is maximized?

Yes, we have thread pool stats to figure that out .

Copy link
Contributor

❌ Gradle check result for 4bedb9c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for b161100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
… listener on it

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Copy link
Contributor

❌ Gradle check result for cc11f7e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, have left some comments.

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Copy link
Contributor

❕ Gradle check result for 2df2683: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 53.44828% with 27 lines in your changes missing coverage. Please review.

Project coverage is 72.00%. Comparing base (691f725) to head (2df2683).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
...ain/java/org/opensearch/threadpool/ThreadPool.java 52.63% 25 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16236      +/-   ##
============================================
- Coverage     72.06%   72.00%   -0.06%     
+ Complexity    64822    64786      -36     
============================================
  Files          5308     5308              
  Lines        302574   302613      +39     
  Branches      43710    43723      +13     
============================================
- Hits         218048   217897     -151     
- Misses        66648    66797     +149     
- Partials      17878    17919      +41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gbbafna gbbafna merged commit 35c366d into opensearch-project:main Oct 15, 2024
37 of 38 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16236-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 35c366ddc794e0600184cf406c06ae65061e28ce
# Push it to GitHub
git push --set-upstream origin backport/backport-16236-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-16236-to-2.x.

gbbafna added a commit to gbbafna/OpenSearch that referenced this pull request Oct 15, 2024
ashking94 pushed a commit that referenced this pull request Oct 15, 2024
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 16, 2024
dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 17, 2024
dk2k pushed a commit to dk2k/OpenSearch that referenced this pull request Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants