W-16941297: Scatter Gather timeout exception #14192

anandkarandikar · 2025-01-31T12:25:38Z

Ticket

Cause

When Scatter Gather times out, the routes that opened db connections, stay open. The clean up code that is registered by the connector for when the event context is completed is never invoked. Event contexts need to complete in order for their clean up jobs to be called.
When a timeout happens in a Reactor chain, the original Publisher is cancelled and the subscription switches to a new Publisher (the fallback publisher of the timeout). The original Publisher in this context contains the assembled inner chain. So, its cancellation prevents the AbstractPipeline from completing the context after all its processors.

Other ideas

StreamingGhostBuster was deemed to have handled this since it's intention is to clean up unclosed stream and its related CursorStreamProvider object. However, in this situation, the reference is a strong reference thus StreamingGhostBuster wouldn't work.

Fix

When an item is emitted from the timeout publisher, we take care of completing the child context with error.
The completion needs to be propagated recursively because there could be nested chains too.
Added kill switch in case we break something.
~~Calling the .error(...) without the Scheduler pool caused the Scatter Gather to wait until the longest SLEEP(n) completes.~~
To help alleviate this behavior, this change utilizes timeoutScheduler making the Scatter Gather timeout as expected and present the composite routing exception messages to the user instantly while the SELECT SLEEP(n) query continue to execute. Once those complete, the .error(...) method is called and submitted to timeoutScheduler.
The timeoutScheduler that's created as cpuLightScheduler is incapable of handling nested Scatter Gather's or large number of routes. This fix was tested with 70 routes with almost all of them timing out. Changing this to ioScheduler was capable to handling this scaling issue.

Test Coverage

Currently there are tests in org/mule/runtime/core/internal/routing/forkjoin that have timeout events being raised.
Also leveraging a timeout scenario with Scatter Gather using test-extensions (marvel-extension) to mimic the delayed scenario. This is a similar approach to W-16941297: SG timeout issue mule-integration-tests#2634 but without needing the actual database in the picture, because we need to ensure that the underlying streams are closed.

…kground

…tractForkJoinStrategyFactory

asanguinetti · 2025-01-31T17:00:35Z

...-components/src/main/java/org/mule/runtime/core/internal/routing/AbstractForkJoinRouter.java

-    timeoutScheduler = schedulerService.cpuLightScheduler(SchedulerConfig.config()
+    timeoutScheduler = schedulerService.ioScheduler(SchedulerConfig.config()


I would need a good justification for this change.
The tasks being submitted to that scheduler are better defined by "CPU light" rather than "I/O intensive". If this change is needed for something to work, we definitely need to understand why, because initially it doesn't make sense.

Would having a underlying db connection counted as an I/O category?
We were able to notice that cpuLightScheduler doesn't work when Scatter Gather contains a lot of routes.

I don't think that matters. What matters is what you are doing in the tasks that you submit to the scheduler. For example if the tasks require sleeping or blocking on I/O a lot.
In this case I think the problem is that you are submitting just too many tasks, beyond the estimated capacity for the pool type.

That's possible. I've been scaling the SG with nested SG that maybe really not the actually customer scenario. Given that the bound is dependent on the # of cores, its possible my laptop wasn't able to handle that many cpu bound tasks.

core/src/main/java/org/mule/runtime/core/internal/event/AbstractEventContext.java

anandkarandikar · 2025-02-03T09:32:54Z

...in/java/org/mule/runtime/core/internal/routing/forkjoin/AbstractForkJoinStrategyFactory.java

@@ -146,7 +146,7 @@ private void handleTimeoutExceptionIfPresent(Scheduler timeoutScheduler,
                     EventContext context = pair.getFirst().getContext();
                     if (context instanceof AbstractEventContext) {
                       ((AbstractEventContext) context).forEachChild(ctx -> timeoutScheduler
-                           .submit(() -> ctx.error(error.get().getCause())));
+                           .submit(() -> ctx.error(new MessagingException(pair.getFirst(), error.get().getCause()))));


Despite working fine, there were failures in the log

ERROR 2025-01-28 17:26:29,170 [pool-5-thread-2] [processor: ; event: ] org.mule.runtime.core.privileged.processor.MessageProcessors: Uncaught exception in childContextResponseHandler java.lang.ClassCastException: class java.util.concurrent.TimeoutException cannot be cast to class org.mule.runtime.core.privileged.exception.MessagingException (java.util.concurrent.TimeoutException is in module java.base of loader 'bootstrap'; org.mule.runtime.core.privileged.exception.MessagingException is in module org.mule.runtime.core@4.9.0-20241025 of loader jdk.internal.loader.Loader @10cc327a) at org.mule.runtime.core@4.9.0-20241025/org.mule.runtime.core.privileged.processor.MessageProcessors.lambda$childContextResponseHandler$14(MessageProcessors.java:582) ~[mule-core-4.9.0-20241025.jar:?] at org.mule.runtime.core@4.9.0-20241025/org.mule.runtime.core.internal.event.AbstractEventContext.signalConsumerSilently(AbstractEventContext.java:310) ~[?:?] at org.mule.runtime.core@4.9.0-20241025/org.mule.runtime.core.internal.event.AbstractEventContext.receiveResponse(AbstractEventContext.java:210) ~[?:?] at org.mule.runtime.core@4.9.0-20241025/org.mule.runtime.core.internal.event.AbstractEventContext.error(AbstractEventContext.java:189) ~[?:?] at org.mule.runtime.core.components@4.9.0-20241025/org.mule.runtime.core.internal.routing.forkjoin.AbstractForkJoinStrategyFactory.lambda$handleTimeoutExceptionIfPresent$6(AbstractForkJoinStrategyFactory.java:173) ~[?:?]

Therefore, creating a MessageException instance

anandkarandikar · 2025-02-05T05:26:26Z

--validate

...in/java/org/mule/runtime/core/internal/routing/forkjoin/AbstractForkJoinStrategyFactory.java

asanguinetti · 2025-02-06T13:57:06Z

...-components/src/main/java/org/mule/runtime/core/internal/routing/AbstractForkJoinRouter.java

-    timeoutScheduler = schedulerService.cpuLightScheduler(SchedulerConfig.config()
+    timeoutScheduler = schedulerService.ioScheduler(SchedulerConfig.config()


I don't think that matters. What matters is what you are doing in the tasks that you submit to the scheduler. For example if the tasks require sleeping or blocking on I/O a lot.
In this case I think the problem is that you are submitting just too many tasks, beyond the estimated capacity for the pool type.

anandkarandikar · 2025-02-07T12:25:53Z

--validate

...in/java/org/mule/runtime/core/internal/routing/forkjoin/AbstractForkJoinStrategyFactory.java

...t/java/org/mule/test/module/extension/streaming/AbstractBytesStreamingExtensionTestCase.java

...sions/marvel-extension/src/main/java/org/mule/test/marvel/drstrange/DrStrangeOperations.java

sonarqube-as-a-service · 2025-02-13T04:05:39Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
5 Code Smells

82.9% Coverage
0.0% Duplication

eze210 · 2025-02-13T12:36:44Z

...org/mule/runtime/core/internal/routing/forkjoin/DefaultRoutePairPublisherAssemblyHelper.java

+  @Override
+  public Publisher<CoreEvent> decorateTimeoutPublisher(Publisher<CoreEvent> timeoutPublisher) {
+    // When the timeout happens, the subscription to the original publisher is cancelled, so the inner MessageProcessorChains
+    // never finishes and the child contexts are never completed, hence we have to complete them manually on timeout


Suggested change

// never finishes and the child contexts are never completed, hence we have to complete them manually on timeout

// never finish and the child contexts are never completed, hence we have to complete them manually on timeout

eze210 · 2025-02-13T12:38:24Z

...org/mule/runtime/core/internal/routing/forkjoin/DefaultRoutePairPublisherAssemblyHelper.java

+      ((AbstractEventContext) eventContext).forEachChild(allContexts::push);
+
+      while (!allContexts.isEmpty()) {
+        BaseEventContext ctx = allContexts.pop();


It may be worth clarifying why we iterate in reverse order in this part

Co-authored-by: Axel Sanguinetti <asanguinetti@salesforce.com> (cherry picked from commit b4b8679)

(cherry picked from commit b4b8679) Co-authored-by: Anand Karandikar <164932509+anandkarandikar@users.noreply.github.com>

Co-authored-by: Axel Sanguinetti <asanguinetti@salesforce.com> (cherry picked from commit b4b8679) (cherry picked from commit 615b721) (cherry picked from commit 24d822d)

Co-authored-by: Axel Sanguinetti <asanguinetti@salesforce.com> (cherry picked from commit b4b8679) (cherry picked from commit 615b721) (cherry picked from commit 24d822d) Co-authored-by: Anand Karandikar <164932509+anandkarandikar@users.noreply.github.com>

(cherry picked from commit b4b8679) (cherry picked from commit b236715) (cherry picked from commit 3b38c65) Co-authored-by: Anand Karandikar <164932509+anandkarandikar@users.noreply.github.com>

anandkarandikar added 9 commits January 22, 2025 22:33

handling timeout exception

5c2b698

Attempt with ExecutorService to process connection closing in the bac…

d1b7b3a

…kground

Remove changes from AbstractEventContext

5806098

Make ChildEventContext accessible and handle timeout exception in Abs…

ec561bd

…tractForkJoinStrategyFactory

Switch to ioScheduler

89d854e

Change visibility to private for ChildEventContext

59dbd95

Make AbstractEventContext public

fce0083

Use AbstractEventContext instead of ChildEventContext

7540dc8

Remove unused import

f3d2532

anandkarandikar requested a review from a team as a code owner January 31, 2025 12:25

anandkarandikar mentioned this pull request Jan 31, 2025

Attempt at handling timeout exception #14172

Closed

asanguinetti reviewed Jan 31, 2025

View reviewed changes

core/src/main/java/org/mule/runtime/core/internal/event/AbstractEventContext.java Show resolved Hide resolved

Handling ClassCastException failure

b1b1390

anandkarandikar commented Feb 3, 2025

View reviewed changes

Merge branch 'master' into fix/W-16941297

3a90f7d

asanguinetti requested changes Feb 6, 2025

View reviewed changes

anandkarandikar and others added 6 commits February 7, 2025 03:57

Add delay to sayMagicWords

e66cf94

Add tests for Scatter Gather timeout and non timeout scenario

abadb33

Apply formatter

835e844

Merge branch 'master' into fix/W-16941297

eceee29

Add assertPayloadIsIteratorProvider check

8c3e66f

Replace Thread.sleep with CountDownLatch

7c2045a

anandkarandikar added 2 commits February 7, 2025 19:08

Fixing model and schema file for marvel extension

b24c317

Revert the version to @mule.runtime.version@

82e2d38

asanguinetti requested changes Feb 7, 2025

View reviewed changes

anandkarandikar and others added 2 commits February 7, 2025 22:45

Using assertThrows

80b3412

Merge branch 'master' into fix/W-16941297

069b02f

asanguinetti added 6 commits February 12, 2025 00:40

Fixing test to make sure it fails when the fix is not present

fa265ea

Avoiding submit, completing the child contexts on timeout emission

282b83b

Comment

1ade55f

Fixing nested routes and adding more tests

eb7d5f3

Improve tests

ec2489b

Kill switch

1e076e1

eze210 approved these changes Feb 13, 2025

View reviewed changes

asanguinetti approved these changes Feb 13, 2025

View reviewed changes

asanguinetti merged commit b4b8679 into master Feb 13, 2025
8 checks passed

asanguinetti deleted the fix/W-16941297 branch February 13, 2025 16:45

asanguinetti pushed a commit that referenced this pull request Feb 13, 2025

W-16941297: Scatter Gather timeout exception (#14192)

013c31f

Co-authored-by: Axel Sanguinetti <asanguinetti@salesforce.com> (cherry picked from commit b4b8679)

asanguinetti pushed a commit that referenced this pull request Feb 13, 2025

W-16941297: Scatter Gather timeout exception (#14192)

e5991bb

Co-authored-by: Axel Sanguinetti <asanguinetti@salesforce.com> (cherry picked from commit b4b8679)

asanguinetti pushed a commit that referenced this pull request Feb 14, 2025

W-16941297: Scatter Gather timeout exception (#14192)

8978fa2

Co-authored-by: Axel Sanguinetti <asanguinetti@salesforce.com> (cherry picked from commit b4b8679)

asanguinetti pushed a commit that referenced this pull request Feb 14, 2025

W-16941297: Scatter Gather timeout exception (#14192)

19f65e5

Co-authored-by: Axel Sanguinetti <asanguinetti@salesforce.com> (cherry picked from commit b4b8679)

asanguinetti added a commit that referenced this pull request Feb 14, 2025

W-16941297: Scatter Gather timeout exception (#14192) (#14219)

b9530ee

(cherry picked from commit b4b8679) Co-authored-by: Anand Karandikar <164932509+anandkarandikar@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W-16941297: Scatter Gather timeout exception #14192

W-16941297: Scatter Gather timeout exception #14192

anandkarandikar commented Jan 31, 2025 •

edited by asanguinetti

Loading

asanguinetti Jan 31, 2025

anandkarandikar Jan 31, 2025

asanguinetti Feb 6, 2025

anandkarandikar Feb 6, 2025 •

edited

Loading

anandkarandikar Feb 3, 2025 •

edited

Loading

anandkarandikar commented Feb 5, 2025

asanguinetti Feb 6, 2025

anandkarandikar commented Feb 7, 2025

sonarqube-as-a-service bot commented Feb 13, 2025

eze210 Feb 13, 2025

eze210 Feb 13, 2025

		timeoutScheduler = schedulerService.cpuLightScheduler(SchedulerConfig.config()
		timeoutScheduler = schedulerService.ioScheduler(SchedulerConfig.config()

	// never finishes and the child contexts are never completed, hence we have to complete them manually on timeout
	// never finish and the child contexts are never completed, hence we have to complete them manually on timeout

W-16941297: Scatter Gather timeout exception #14192

W-16941297: Scatter Gather timeout exception #14192

Conversation

anandkarandikar commented Jan 31, 2025 • edited by asanguinetti Loading

Ticket

Cause

Other ideas

Fix

Test Coverage

asanguinetti Jan 31, 2025

Choose a reason for hiding this comment

anandkarandikar Jan 31, 2025

Choose a reason for hiding this comment

asanguinetti Feb 6, 2025

Choose a reason for hiding this comment

anandkarandikar Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

anandkarandikar Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

anandkarandikar commented Feb 5, 2025

asanguinetti Feb 6, 2025

Choose a reason for hiding this comment

anandkarandikar commented Feb 7, 2025

sonarqube-as-a-service bot commented Feb 13, 2025

eze210 Feb 13, 2025

Choose a reason for hiding this comment

eze210 Feb 13, 2025

Choose a reason for hiding this comment

anandkarandikar commented Jan 31, 2025 •

edited by asanguinetti

Loading

anandkarandikar Feb 6, 2025 •

edited

Loading

anandkarandikar Feb 3, 2025 •

edited

Loading