Skip to content

Release operation permit on thread-pool rejection #25930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 27, 2017

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Jul 27, 2017

At the shard level we use an operation permit to coordinate between regular shard operations and special operations that need exclusive access. In ES versions < 6, the operation requiring exclusive access was invoked during primary relocation, but ES versions >= 6 this exclusive access is also used when a replica learns about a new primary or when a replica is promoted to primary.

These special operations requiring exclusive access delay regular operations from running, by adding them to a queue, and after finishing the exclusive access, release these operations which then need to be put back on the original thread-pool they were running on. In the presence of thread pool rejections, the current implementation had two issues:

  • it would not properly release the operation permit when hitting a rejection (i.e. when calling ThreadedActionListener.onResponse from IndexShardOperationPermits.acquire).
  • it would not invoke the onFailure method of the action listener when the shard was closed, and just log a warning instead (see ThreadedActionListener.onFailure), which would ultimately lead to the replication task never being cleaned up (see Tasks stuck forever #25863).

This PR fixes both issues by introducing a custom threaded action listener that is permit-aware and properly deals with rejections.

Closes #25863

@ywelsch ywelsch added :Core/Infra/Core Core issues without another label >bug v5.5.2 v6.0.0 v7.0.0 labels Jul 27, 2017
@ywelsch ywelsch requested a review from bleskes July 27, 2017 09:46
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM in sense that it does the right thing while maintaining the old behavior of the generic ThreadedActionListener. However, in this case, I'm a bit concerned about the number of options in which the failure handler can be called - sometimes on the requested executor, sometimes on the generic thread and some times on the original thread - so there is really no guarantee on what's "safe" to do there. I wonder if we should make it simpler code wise and say that the executor on holds for onResponse. The failure handler can be called on arbitrary threads and should only be used for light weight work (which we do). WDYT?


@Override
public void onFailure(Exception e) {
listener.onFailure(e); // will possibly execute on the generic thread spawned off in releaseDelayedOperations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we release the releasable just in case the exception comes from the listener? this would allow us to only implement on failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the responsibility of the wrapped listener to make sure to free the permit when encountering an exception in its onResponse method. I prefer the code to be more explicit here instead of mixing two failures paths.


@Override
public void onFailure(Exception e) {
listener.onFailure(e); // will possibly execute on the generic thread spawned off in releaseDelayedOperations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of rejection, I think we want to add the original exception as "suppressed"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment becomes obsolete by running the PermitAwareThreadedActionListener.onFailure() method on the caller thread.

@@ -59,7 +63,18 @@

@BeforeClass
public static void setupThreadPool() {
threadPool = new TestThreadPool("IndexShardOperationsLockTests");
int bulkThreadPoolSize = randomIntBetween(1, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to randomly use limited capacity.

@ywelsch ywelsch requested a review from bleskes July 27, 2017 10:26
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @ywelsch . My only nit is to add java docs for the required "lightness" of the onFailure handling in the onAcquired listener.

@ywelsch ywelsch merged commit 620536f into elastic:master Jul 27, 2017
@ywelsch
Copy link
Contributor Author

ywelsch commented Jul 27, 2017

Thanks @bleskes

ywelsch added a commit that referenced this pull request Jul 27, 2017
At the shard level we use an operation permit to coordinate between regular shard operations and special operations that need exclusive access. In ES versions < 6, the operation requiring exclusive access was invoked during primary relocation, but ES versions >= 6 this exclusive access is also used when a replica learns about a new primary or when a replica is promoted to primary.

These special operations requiring exclusive access delay regular operations from running, by adding them to a queue, and after finishing the exclusive access, release these operations which then need to be put back on the original thread-pool they were running on. In the presence of thread pool rejections, the current implementation had two issues:

- it would not properly release the operation permit when hitting a rejection (i.e. when calling ThreadedActionListener.onResponse from IndexShardOperationPermits.acquire).
- it would not invoke the onFailure method of the action listener when the shard was closed, and just log a warning instead (see ThreadedActionListener.onFailure), which would ultimately lead to the replication task never being cleaned up (see #25863).

This commit fixes both issues by introducing a custom threaded action listener that is permit-aware and properly deals with rejections.

Closes #25863
@ywelsch ywelsch added the v6.1.0 label Jul 27, 2017
ywelsch added a commit that referenced this pull request Jul 27, 2017
At the shard level we use an operation permit to coordinate between regular shard operations and special operations that need exclusive access. In ES versions < 6, the operation requiring exclusive access was invoked during primary relocation, but ES versions >= 6 this exclusive access is also used when a replica learns about a new primary or when a replica is promoted to primary.

These special operations requiring exclusive access delay regular operations from running, by adding them to a queue, and after finishing the exclusive access, release these operations which then need to be put back on the original thread-pool they were running on. In the presence of thread pool rejections, the current implementation had two issues:

- it would not properly release the operation permit when hitting a rejection (i.e. when calling ThreadedActionListener.onResponse from IndexShardOperationPermits.acquire).
- it would not invoke the onFailure method of the action listener when the shard was closed, and just log a warning instead (see ThreadedActionListener.onFailure), which would ultimately lead to the replication task never being cleaned up (see #25863).

This commit fixes both issues by introducing a custom threaded action listener that is permit-aware and properly deals with rejections.

Closes #25863
ywelsch added a commit that referenced this pull request Jul 27, 2017
At the shard level we use an operation permit to coordinate between regular shard operations and special operations that need exclusive access. In ES versions < 6, the operation requiring exclusive access was invoked during primary relocation, but ES versions >= 6 this exclusive access is also used when a replica learns about a new primary or when a replica is promoted to primary.

These special operations requiring exclusive access delay regular operations from running, by adding them to a queue, and after finishing the exclusive access, release these operations which then need to be put back on the original thread-pool they were running on. In the presence of thread pool rejections, the current implementation had two issues:

- it would not properly release the operation permit when hitting a rejection (i.e. when calling ThreadedActionListener.onResponse from IndexShardOperationPermits.acquire).
- it would not invoke the onFailure method of the action listener when the shard was closed, and just log a warning instead (see ThreadedActionListener.onFailure), which would ultimately lead to the replication task never being cleaned up (see #25863).

This commit fixes both issues by introducing a custom threaded action listener that is permit-aware and properly deals with rejections.

Closes #25863
ywelsch added a commit that referenced this pull request Jul 27, 2017
At the shard level we use an operation permit to coordinate between regular shard operations and special operations that need exclusive access. In ES versions < 6, the operation requiring exclusive access was invoked during primary relocation, but ES versions >= 6 this exclusive access is also used when a replica learns about a new primary or when a replica is promoted to primary.

These special operations requiring exclusive access delay regular operations from running, by adding them to a queue, and after finishing the exclusive access, release these operations which then need to be put back on the original thread-pool they were running on. In the presence of thread pool rejections, the current implementation had two issues:

- it would not properly release the operation permit when hitting a rejection (i.e. when calling ThreadedActionListener.onResponse from IndexShardOperationPermits.acquire).
- it would not invoke the onFailure method of the action listener when the shard was closed, and just log a warning instead (see ThreadedActionListener.onFailure), which would ultimately lead to the replication task never being cleaned up (see #25863).

This commit fixes both issues by introducing a custom threaded action listener that is permit-aware and properly deals with rejections.

Closes #25863
@lcawl lcawl removed the v6.1.0 label Dec 12, 2017
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tasks stuck forever
5 participants