Skip to content

Fork remote-cluster response handling #97922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Jul 25, 2023

Today all responses to remote-cluster requests are deserialized and
handled on the transport worker thread. Some of these responses can be
sizeable, so with this commit we add the facility for callers to specify
a different executor to handle this work. It also adjusts several
callers to use more appropriate threads, including:

  • responses from CCR-related admin actions are handled on ccr
  • responses from field caps actions are handled on search_coordination

Today all responses to remote-cluster requests are deserialized and
handled on the transport worker thread. Some of these responses can be
sizeable, so with this commit we add the facility for callers to specify
a different executor to handle this work.
@DaveCTurner DaveCTurner added >non-issue :Distributed Coordination/Network Http and internode communication implementations v8.10.0 labels Jul 25, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 25, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Jul 25, 2023

One thing to note here is that we are forking to pools that have bounded queues. Semantically that's ok, we force-execute these tasks (see org.elasticsearch.transport.ForkingResponseHandlerRunnable) but we may see more rejections of other work after this change in borderline-overloaded clusters. But then again that still seems preferable to spamming the transport worker threads.

Edit: also, I'm upgrading this to >bug because of how CCR requests an awful lot of index metadata when looking for new indices to follow, which in a big cluster is going to cause problems without this change.

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I am in doubt about one of the thread pools though.

@@ -466,6 +466,7 @@ static void ccsRemoteReduce(
ActionListener<SearchResponse> listener,
BiConsumer<SearchRequest, ActionListener<SearchResponse>> localSearchConsumer
) {
final var remoteClientResponseExecutor = threadPool.executor(ThreadPool.Names.SEARCH_COORDINATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should use the SEARCH pool instead, since that is the pool other work here is spawned on.

I think the coordination pool is mainly for the node level can-match/fieldcaps actions.

Might not matter a whole lot, but I'd rather induce any rejections due to this on the SEARCH thread pool than the SEARCH_COORDINATOR pool.

Copy link
Contributor Author

@DaveCTurner DaveCTurner Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert this to SAME to make progress here, and raise it with the search folks to follow-up. I think what we're doing here is more like coordination work so enqueueing it behind some IO-heavy search activity might do bad things for performance, but there's definitely tradeoffs either way.

Edit: I pushed 801500b and opened #97997

@@ -373,10 +373,6 @@ public Optional<EngineFactory> getEngineFactory(final IndexSettings indexSetting

@SuppressWarnings("HiddenField")
public List<ExecutorBuilder<?>> getExecutorBuilders(Settings settings) {
if (enabled == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters, but was this change necessary or did you just find the "optimization" to not have the extra thread pool unnecessary?

Copy link
Contributor Author

@DaveCTurner DaveCTurner Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We apparently do some CCR-related things even if CCR is disabled, or at least there are tests which now fail if the CCR threadpool's existence is contingent on whether it's enabled or not. Seemed simplest to just create the threadpool either way rather than pick all that apart.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2 nice :)

@DaveCTurner DaveCTurner merged commit f4e3113 into elastic:main Jul 27, 2023
@DaveCTurner DaveCTurner deleted the 2023/07/25/RemoteClusterClient-forking branch July 27, 2023 09:26
@DaveCTurner
Copy link
Contributor Author

Thanks both :)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 2, 2023
Today we deserialize the chunks received when creating a follower on the
CCR pool (prior to elastic#97922 this was the transport thread) and then
manually fork off to the GENERIC pool to write the chunk to disk. It's
simpler just to do all this on the GENERIC pool to start with, so this
commit does that.
elasticsearchmachine pushed a commit that referenced this pull request Aug 2, 2023
Today we deserialize the chunks received when creating a follower on the
CCR pool (prior to #97922 this was the transport thread) and then
manually fork off to the GENERIC pool to write the chunk to disk. It's
simpler just to do all this on the GENERIC pool to start with, so this
commit does that.
@DaveCTurner DaveCTurner restored the 2023/07/25/RemoteClusterClient-forking branch June 17, 2024 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Network Http and internode communication implementations Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants