Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ConcurrentQueryProfiler to profile query using concurrent segment search path and support concurrency during rewrite and create weight #10352

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

ticheng-aws
Copy link
Contributor

@ticheng-aws ticheng-aws commented Oct 4, 2023

Description

The rewrite_time field contains the cumulative total rewrite time for a query. In the concurrent search case, the issue is “rewrite” happened on "before" (i.e. non concurrent) and "during" the concurrent path. There is only 1 rewrite timer shared across all slices, which lead the racing condition happened in the multiple threadings case.

To resolve the issue, we need to have a distinct rewrite timer for each query, then compute the sum of the non-overlapping time across all slices and add the rewrite time from the non-concurrent path.

A similar race condition happened on building the profile tree stack during the create weight step. To address this issue, I added threadToProfileTree map into the profiler. This way, each thread has its own profile tree, ensuring that the stack won't access tree nodes from other threads. In the end, we simply combine all the trees into the profileResults list.

Related Issues

Resolves #9787

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • GitHub issue/PR created in OpenSearch documentation repo for the required public documentation changes (#)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2023

Compatibility status:

Checks if related components are compatible with change 47a5802

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ticheng-aws ticheng-aws self-assigned this Oct 19, 2023
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ticheng-aws
Copy link
Contributor Author

Gradle Check (Jenkins) Run Completed with:

Tests with failures:
 - org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards
 - org.opensearch.remotestore.RemoteStoreClusterStateRestoreIT.testFullClusterRestoreGlobalMetadata
 - org.opensearch.search.SearchWeightedRoutingIT.testShardRoutingWithNetworkDisruption_FailOpenEnabled

Known issues #10558, #10749, #10673

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all}
      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation

@reta
Copy link
Collaborator

reta commented Oct 20, 2023

#5329
#9332

@reta reta merged commit 200ad5d into opensearch-project:main Oct 20, 2023
16 checks passed
@reta reta added backport 2.x Backport to 2.x branch v3.0.0 Issues and PRs related to version 3.0.0 labels Oct 20, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-10352-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 200ad5d28a577877be530ecab507601898025c5c
# Push it to GitHub
git push --set-upstream origin backport/backport-10352-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-10352-to-2.x.

@reta
Copy link
Collaborator

reta commented Oct 20, 2023

@ticheng-aws sadly backport to 2.x failed, could you please send a manual pull request? thank you

@sohami
Copy link
Collaborator

sohami commented Oct 20, 2023

@ticheng-aws sadly backport to 2.x failed, could you please send a manual pull request? thank you

@ticheng-aws We will need to backport all the previous PRs for concurrent profiler as well. I think during 2.11 we didn't do it. But lets do it now. Then it will go thru for this one as well.

@reta
Copy link
Collaborator

reta commented Oct 20, 2023

@ticheng-aws We will need to backport all the previous PRs for concurrent profiler as well. I think during 2.11 we didn't do it. But lets do it now. Then it will go thru for this one as well.

Oh wait , is it because we run into NPEs?

austintlee pushed a commit to austintlee/OpenSearch that referenced this pull request Oct 23, 2023
…egment search path and support concurrency during rewrite and create weight (opensearch-project#10352)

* Fix timer race condition in profile rewrite and create weight for concurrent segment search (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Refactor and work on the PR comments (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

---------

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
@ticheng-aws
Copy link
Contributor Author

@ticheng-aws We will need to backport all the previous PRs for concurrent profiler as well. I think during 2.11 we didn't do it. But lets do it now. Then it will go thru for this one as well.

Oh wait , is it because we run into NPEs?

You are right. We need to backport #9248, #10111, #10547, and the current one #10352.

sohami pushed a commit to sohami/OpenSearch that referenced this pull request Oct 24, 2023
…egment search path and support concurrency during rewrite and create weight (opensearch-project#10352)

* Fix timer race condition in profile rewrite and create weight for concurrent segment search (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Refactor and work on the PR comments (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

---------

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
sohami pushed a commit to sohami/OpenSearch that referenced this pull request Oct 30, 2023
…egment search path and support concurrency during rewrite and create weight (opensearch-project#10352)

* Fix timer race condition in profile rewrite and create weight for concurrent segment search (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Refactor and work on the PR comments (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

---------

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
sohami pushed a commit to sohami/OpenSearch that referenced this pull request Oct 31, 2023
…egment search path and support concurrency during rewrite and create weight (opensearch-project#10352)

* Fix timer race condition in profile rewrite and create weight for concurrent segment search (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Refactor and work on the PR comments (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

---------

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
reta pushed a commit that referenced this pull request Oct 31, 2023
…#10898)

* Add support for query profiler with concurrent aggregation (#9248)

* Add support for query profiler with concurrent aggregation (#9248)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Refactor and work on the PR comments

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Update collectorToLeaves mapping for children breakdowns post profile metric collection and before creating the results

Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

* Refactor logic to compute the slice level breakdown stats and query level breakdown stats for concurrent search case

Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

* Fix QueryProfilePhaseTests and QueryProfileTests, and parameterize QueryProfilerIT with concurrent search enabled

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Handle the case when there are no leaf context to compute the profile stats to return default stats
for all breakdown type along with min/max/avg values. Replace queryStart and queryEnd time with queryNodeTime

Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

* Add UTs for ConcurrentQueryProfileBreakdown

Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

* Add concurrent search stats test into the QueryProfilerIT

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Address review comments

Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

---------

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>
Co-authored-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

* Fix NPE in ConcurrentQueryProfile while computing the breakdown map for slices (#10111)

* Fix NPE in ConcurrentQueryProfile while computing the breakdown map for slices.

There can be cases where one or more slice may not have timing related information for its leaves in
contexts map. During creation of slice and query level breakdown map it needs to handle such cases by using
default values correctly. Also updating the min/max/avg sliceNodeTime to not include time to create
weight and wait times by slice threads. It will reflect the min/max/avg execution time of each slice
whereas totalNodeTime will reflect the total query time.

Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

* Address review comments

Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

---------

Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>

* Fix flaky query profile phase tests with concurrent search enabled (#10547) (#10547)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Introduce ConcurrentQueryProfiler to profile query using concurrent segment search path and support concurrency during rewrite and create weight (#10352)

* Fix timer race condition in profile rewrite and create weight for concurrent segment search (#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Refactor and work on the PR comments (#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

---------

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

---------

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
Signed-off-by: Sorabh Hamirwasia <sohami.apache@gmail.com>
Co-authored-by: Ticheng Lin <51488860+ticheng-aws@users.noreply.github.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…egment search path and support concurrency during rewrite and create weight (opensearch-project#10352)

* Fix timer race condition in profile rewrite and create weight for concurrent segment search (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

* Refactor and work on the PR comments (opensearch-project#10352)

Signed-off-by: Ticheng Lin <ticheng@amazon.com>

---------

Signed-off-by: Ticheng Lin <ticheng@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed bug Something isn't working flaky-test Random test failure that succeeds on second run v3.0.0 Issues and PRs related to version 3.0.0
Projects
None yet
4 participants