-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Add new Threadpool for Cluster Coordination Activities #83576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Threadpool for Cluster Coordination Activities #83576
Conversation
Using the generic pool for these all the activities surrounding cluster coordination seems broken. They all get blocked on the same mutex in the `Coordinator` eventually and run effectively sequentially. Particularly in clusters with larger node counts this could lead to situations where lots of generic threads are needlessly spun up only for the purpose of waiting on the mutex. Since we at times also lock on the mutex in the coordinator on transport threads, it is particularly unforuntate when there's lots of generic threads waiting on it. => fix this by using a single threaded pool for coordination work.
Pinging @elastic/es-distributed (Team:Distributed) |
That sounds like a bug, we do nontrivial IO (writing the cluster state to disk) under this mutex. Let's fix this. I'm +1 on the general idea here, my thinking being that the |
@@ -219,6 +220,7 @@ public ThreadPool(final Settings settings, final ExecutorBuilder<?>... customBui | |||
new ScalingExecutorBuilder(Names.FETCH_SHARD_STARTED, 1, 2 * allocatedProcessors, TimeValue.timeValueMinutes(5), false) | |||
); | |||
builders.put(Names.FORCE_MERGE, new FixedExecutorBuilder(settings, Names.FORCE_MERGE, 1, -1, false)); | |||
builders.put(Names.COORDINATION, new ScalingExecutorBuilder(Names.COORDINATION, 1, 1, TimeValue.timeValueSeconds(30), false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to not introduce a new scaling thread pool that does'nt handle rejections after shutdown. Given the limited area where this thread pool is used I think we could handle rejection everywhere and set the rejectAftershutdown
flag to true
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ I made it a fixed pool of size 1 now. And fixed the one spot where this could have failed that I found.
Jenkins run elasticsearch-ci/part-2 (unrelated + known) |
I looked into this when coming up with this PR. The spots I found were:
I should also add that the motivation for this PR is around some of the transport requests that fork to generic logging slow warnings. I can't for the life of me come up with a reason why they would be so slow. But I think there's a certain chance that it's the unfortunate combination of an extremely contended generic pool queue and just handling many of these requests at times.
Sorry this turned out to be (sorta) fake news. We actually only run into this in tests because of the way we schedule some callbacks on shutdown. In production operation I couldn't see any contention here so nothing to worry about. The contention on this lock however is pretty extreme in some scenarios and running up tons of waiting GENERIC threads on a non-fair mutex seems broken. |
Great, thanks for checking. Yes the lag detector should be non-critical.
If we're not already connected then opening a connection will do its stuff on the transport worker and ultimately complete the listener on the Also yikes we acquire the mutex on the master update thread when reconfiguring (I opened #83682)
Just trying to understand this a bit more. AIUI this PR mostly removes contention at the consumer side of the threadpool queue? I suppose it'll fix contention between enqueuing coordinator and non-coordinator activities but is that a problem? Maybe appending to a nonempty queue is cheaper than hunting for (or creating) a spare executor when the queue is empty? Still very surprising if it takes ≥5s to complete. |
Not that I'm aware of. We had similar situations before where it would be nice to "know if we have to fork" but we just don't have the primitive for that. We could probably add it, but I'm not sure it's absolutely necessary.
Yea exactly, this is what this aims at. Whether this is a problem I can't say, but it's pretty much the last explanation for some issues that I can think of. What I have as far as clues go is this:
Hunting for should be cheaper than appending with the I figured we have a single |
No further comments really, I'm good to proceed, but I'd like to see if anyone raises anything we've not already covered in a team discussion too. If we do make this single-threaded then I think we may as well follow up by trying to remove the mutex entirely. Obviously that would need #83682 and the point about opening connections forking to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
||
@Override | ||
public void onRejection(Exception e) { | ||
logger.debug("threadpool was shut down", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow this, is that not identical to the default ThreadedRunnable
behavior (used by schedule
), except not checking that the threadpool was indeed shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the problem was that inside ThreadedRunnable.run
we just call:
executor.execute(runnable);
If this is called with an AbstractRunnable
then we will call onRejection
on it instead of throwing the rejection and the handling in ThreadedRunnable
doesn't come into play => we need to manually catch this case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, thanks, I have been here before 🙂
Thanks all! |
Using the generic pool for these all the activities surrounding cluster coordination seems broken. They all get blocked on the same mutex in the `Coordinator` eventually and run effectively sequentially. Particularly in clusters with larger node counts this could lead to situations where lots of generic threads are needlessly spun up only for the purpose of waiting on the mutex. Since we at times also lock on the mutex in the coordinator on transport threads, it is particularly unforuntate when there's lots of generic threads waiting on it. => fix this by using a single threaded pool for coordination work.
In elastic#83576 we moved cluster coordination activities to a threadpool which rejects actions on shutdown. elastic#84483 is a test failure caused by a missing rejection handler, which this commit addresses. Closes elastic#84483
@DaveCTurner @original-brownbear I wonder if we can move |
It's deliberate that those actions' request handlers run on |
In elastic#83576 we made it so that forking off the transport thread when handling a follower check might be rejected if we're shutting down. In that case we don't pass the rejection exception back out and instead rely on the transport connection's close to notify the caller. We also manipulate the transport channel directly which misses some exception cases. This commit adjusts the implementation to fix these things.
In #83576 we made it so that forking off the transport thread when handling a follower check might be rejected if we're shutting down. In that case we don't pass the rejection exception back out and instead rely on the transport connection's close to notify the caller. We also manipulate the transport channel directly which misses some exception cases. This commit adjusts the implementation to fix these things.
Using the generic pool for these all the activities surrounding cluster coordination
seems broken. They all get blocked on the same mutex in the
Coordinator
eventuallyand run effectively sequentially.
Particularly in clusters with larger node counts this could lead to situations where
lots of generic threads are needlessly spun up only for the purpose of waiting on the mutex.
Since we at times also lock on the mutex in the coordinator on transport threads, it is
particularly unfortunate when there's lots of generic threads waiting on it.
=> fix this by using a single threaded pool for coordination work.
This also allows inspecting the queue size for these tasks for debugging purposes instead of having them hidden in the generic pool's queue.