Skip to content

[BUG-EXTERNAL] Excessive $NextInner objects (EH and SB) #28235

@anuchandy

Description

@anuchandy

Root causing

Cx deployed An application using SB session to 10 pods. A heap dump was taken from one pod after six days of execution, and showing excessive allocation of "reactor.core.publisher.NextProcessor$NextInner" type

HG

Out of 1,065,540, the 1,065,504 "NextInner" instances retains "MonoCacheTime+MonoCacheTime$CoordinatedSubscriber" instances, those also appeared on the top of the above histogram.

MCT_Inbound

Such ~1M instances of these types are leaking symptoms.

These 1,065,504 "NextInner" objects are retained by one "reactor.core.publisher.NextProcessor" object.

NP_26

NP_NPI

that "NextProcessor" is retained by the private variable shutdownSignalSink (type=Sink.One) in ReactorConnection.

SSS_Retained

While there are multiple places library subscribe to this shutdownSignalSink, the only place that exposes it via the intermediate MonoCacheTime operator (as appears in the second figure) by applying cache() operator is ReactorConnection::getShutdownSignal() method

private final Sinks.One<AmqpShutdownSignal> shutdownSignalSink = Sinks.one();

@Override
public Flux<AmqpShutdownSignal> getShutdownSignals() {
    return shutdownSignalSink.asMono().cache().flux();
}

Scanning the source code for reference to the method getShutdownSignals(), the only place it gets referenced and subscribed is in ReactorReceiver constructor

 amqpConnection.getShutdownSignals().flatMap(signal -> {
    logger.verbose("Shutdown signal received.");
    return closeAsync("Connection shutdown.", null);
}).subscribe());

But the subscription is DISPOSED in the ReactorReceiver close api.

Had it a bug in not closing ReactorReceiver, there would be ~1M instances of ReactorReceiver, but actually, there are only 250 of them.

SBRR

It proves that ReactorReceivers are not getting leaked and are disposed correctly. It leads us to check the implementation of Sink.One and identified a problem in Sink.One where it continues to retain the subscriber even after the disposition. A git-ticket is opened in reactor-core repro "Memory leak in SinkOneMulticast" reactor/reactor-core#3001, and the fix coming in reactor-core-3.4.17

Observation_1 (Cx action item)

You might have noticed that the reactor git-ticket mention about the type SinkOneMulticast, but in heap-dump, it's a different type NextProcessor.

The reason for that is - Cx application seems to have one or more dependency that brings in a relatively old version of the reactor-core library (~7 months behind). The Azure ServiceBus SDK is defined to use the recent version of reactor-core-3.4.14.

Back in Sept 2021, the use of NextProcessor in Sink.One was replaced with SinkOneMulticast; in this commit

Cx needs to analyze the dependencies and align the versions of shared libraries (e.g., reactor-core, reactor-netty, etc..) by upgrading dependencies so that the application is ready to pick when reactor-core-3.4.17 is available.

Observation_2 (Cx action item)

The 1,065,504 instances of "NextProcessor$NextInner" means, over the period of 6 days, around ~1M ReactorReceiver objects were created and closed. It means ~125 ReactorReceiver instances created-disposed per minute.

The only reason for such a massive churn of these objects is - that the consuming application is trying to acquire too many sessions from the service, but the producer application does not create enough sessions. Due to this, SB service DETACHes those unnecessary receivers after a 1 minute timeout. This indicates that maxConcurrentSession in the consumer application is too large. The Cx should tune this configuration according to the expected load.

Observation_3: (Sdk action-item)

The getShutdownSignals() API uses cache() operator. We don't have to use the cache operator here, Sink.One is capable of remembering the last signal and replaying it. While cache() doesn't directly contribute to any leak, we could remove it and save allocations.

And the azure-core should upgrade to reactor-core 3.4.17 once released.

Metadata

Metadata

Assignees

Labels

Event HubsService Buspillar-reliabilityThe issue is related to reliability, one of our core engineering pillars. (includes stress testing)tracking-external-issueThe issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions