Skip to content

[FLINK-34644][runtime] Fix race condition in RestServerEndpoint shutdown#27708

Open
MartijnVisser wants to merge 2 commits intoapache:masterfrom
MartijnVisser:FLINK-34644-RestServerEndpointITCase
Open

[FLINK-34644][runtime] Fix race condition in RestServerEndpoint shutdown#27708
MartijnVisser wants to merge 2 commits intoapache:masterfrom
MartijnVisser:FLINK-34644-RestServerEndpointITCase

Conversation

@MartijnVisser
Copy link
Contributor

What is the purpose of the change

Fix the flaky test RestServerEndpointITCase.testShouldWaitForHandlersWhenClosing by addressing a race condition in AbstractRestHandler.respondToRequest() where the HTTP response write future was silently discarded, allowing the in-flight request tracker to deregister requests before the response was fully flushed to the network.

Brief change log

  • Changed thenAccept to thenCompose in AbstractRestHandler.respondToRequest() so that the returned future only completes after HandlerUtils.sendResponse() has fully flushed the HTTP response to the network

Verifying this change

This change is already covered by existing tests, such as RestServerEndpointITCase.testShouldWaitForHandlersWhenClosing which verifies that in-flight requests complete successfully during server shutdown. The test was previously flaky due to this race condition and now passes reliably.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

…own causing flaky test

Change thenAccept to thenCompose in AbstractRestHandler.respondToRequest()
so that the returned future only completes after the HTTP response is fully
flushed to the network. Previously, the future returned by
HandlerUtils.sendResponse() was silently discarded, allowing
InFlightRequestTracker.deregisterRequest() to fire before the response
write completed. Under load this let shutDownInternal() tear down the
Netty event loops while response bytes were still in flight, causing
ConnectionClosedException on the client side.
@MartijnVisser MartijnVisser requested a review from XComp February 27, 2026 16:47
@flinkbot
Copy link
Collaborator

flinkbot commented Feb 27, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@rionmonster
Copy link
Contributor

rionmonster commented Feb 27, 2026

@MartijnVisser

I did a quick cursory search for other use of the thenAccept() calls to identify any other potential related problems and I think we may want to also update the AbstractSqlGatewayRestHandler as it seems like it may be subject to the same problem:

// AbstractSqlGatewayRestHandler:91-99
return response.thenAccept(
    resp ->
        HandlerUtils.sendResponse(
            ctx,
            httpRequest,
            resp,
            messageHeaders.getResponseStatusCode(),
            responseHeaders));

Thoughts? Otherwise the proposed fix seems reasonable! 👍

@MartijnVisser
Copy link
Contributor Author

MartijnVisser commented Feb 27, 2026

I did a quick cursory search for other use of the thenAccept() calls
Thoughts? Otherwise the proposed fix seems reasonable! 👍

Should have done that myself, good call-out. Let me fix that, thanks!

…n AbstractSqlGatewayRestHandler

Apply the same fix as AbstractRestHandler: change thenAccept to
thenCompose in respondToRequest() so the returned future only completes
after the HTTP response is fully flushed. This prevents the same race
condition where in-flight request deregistration could fire before the
response write completes.
@rionmonster
Copy link
Contributor

@MartijnVisser

All good! Approved! 👍

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants