Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 13, 2025

Description

Netty leak detector reported ByteBuf leaks in CI pipeline. Investigation revealed two issues:

  1. ByteBuf close failures in StoreResponse were only logged at DEBUG level, making diagnosis impossible
  2. Actual leak source was in RetryContextOnDiagnosticTest where ByteBuf lifecycle was not properly managed

Changes

StoreResponse.java

  • Upgraded exception handling in StoreResponse constructor:
    • Changed catch from IOException to Throwable to capture Netty's IllegalReferenceCountException and similar runtime exceptions
    • Elevated log level from debug to warn with explicit message: "Failed to close content stream. This may cause a Netty ByteBuf leak."
    • Makes ByteBuf leak failures visible while maintaining backward compatibility (constructor still succeeds on close failure)

File: sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResponse.java (lines 77-79)

RetryContextOnDiagnosticTest.java (Root Cause Fix)

  • Fixed ByteBuf leak in test methods:
    • backoffRetryUtilityExecuteRetry: Changed from Mono.just(new StoreResponse(...)) to Mono.fromCallable(() -> StoreResponseBuilder.create()...)
    • backoffRetryUtilityExecuteAsync: Applied same fix
    • Root cause: Mono.just() eagerly evaluates during mock setup, causing ByteBuf lifecycle issues
    • Solution: Mono.fromCallable() defers StoreResponse creation until subscription, ensuring proper ByteBuf management per-subscription
    • Used StoreResponseBuilder for cleaner, more maintainable test code

File: sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/RetryContextOnDiagnosticTest.java

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Fixes #47265

Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG]Possible Netty buffer leak exposed in test FullFidelityChangeFeedProcessorTest</issue_title>
<issue_description>Describe the bug
The FullFidelityChangeFeedProcessorTest logs in the CI pipeline (not reprducible locally shows a NETTY buffer leak from this call stack records)

NETTY LEAKS detected in test class: com.azure.cosmos.rx.changefeed.epkversion.FullFidelityChangeFeedProcessorTest

NETTY LEAK (traced) type=ByteBuf records=

Recent access records:

1:

io.netty.buffer.ByteBufInputStream.close(ByteBufInputStream.java:145)
com.azure.cosmos.implementation.directconnectivity.StoreResponse.(StoreResponse.java:77)
com.azure.cosmos.implementation.RxGatewayStoreModel.unwrapToStoreResponse(RxGatewayStoreModel.java:226)
com.azure.cosmos.implementation.RxGatewayStoreModel.lambda$null$4(RxGatewayStoreModel.java:445)
reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107)
reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

2:

io.netty.buffer.AdvancedLeakAwareByteBuf.readBytes(AdvancedLeakAwareByteBuf.java:503)
io.netty.buffer.ByteBufInputStream.read(ByteBufInputStream.java:183)
java.io.InputStream.read(InputStream.java:101)
com.azure.cosmos.implementation.directconnectivity.JsonNodeStorePayload.fromJson(JsonNodeStorePayload.java:42)
com.azure.cosmos.implementation.directconnectivity.JsonNodeStorePayload.(JsonNodeStorePayload.java:32)
com.azure.cosmos.implementation.directconnectivity.StoreResponse.(StoreResponse.java:73)
com.azure.cosmos.implementation.RxGatewayStoreModel.unwrapToStoreResponse(RxGatewayStoreModel.java:226)
com.azure.cosmos.implementation.RxGatewayStoreModel.lambda$null$4(RxGatewayStoreModel.java:445)
reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107)
reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:750)

3:

Hint: com.azure.cosmos.implementation.RxGatewayStoreModel@7f395262
com.azure.cosmos.implementation.RxGatewayStoreModel.unwrapToStoreResponse(RxGatewayStoreModel.java:220)
com.azure.cosmos.implementation.RxGatewayStoreModel.lambda$null$4(RxGatewayStoreModel.java:445)
reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107)
reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$S

The code branch is users/fabianm/NettyBufferLeakFixes

</issue_description>

<agent_instructions>Please analyze the Netty leak detector records and propose a fix. Teh relevant Cosmos files are under sdk/cosmos/azure-cosmos in the base branch provided</agent_instructions>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Change log level from debug to warn and catch Throwable instead of just IOException to make potential ByteBuf leak issues more visible.

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix possible Netty buffer leak in FullFidelityChangeFeedProcessorTest Fix Netty ByteBuf leak detection visibility in StoreResponse Nov 13, 2025
Copilot finished work on behalf of FabianMeiswinkel November 13, 2025 17:38
@FabianMeiswinkel
Copy link
Member

/azp run java - cosmos - ci

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@FabianMeiswinkel
Copy link
Member

@copilot - sure that change can be sueful but does not really explain the leak. Looking further into the logs I realized that the actual leak happens in the previous test class RetryContextOnDiagnosticTest - and is just reported later in FullFidelityChangeFeedProcessorTest. I updated the issue descritpion with the correct Netty leak records. Please take another look and see hwetehr you can find and fix the problem.

Copy link
Contributor Author

Copilot AI commented Nov 13, 2025

@copilot - sure that change can be sueful but does not really explain the leak. Looking further into the logs I realized that the actual leak happens in the previous test class RetryContextOnDiagnosti...

I've found the actual leak in RetryContextOnDiagnosticTest. The issue is in lines 108-116 and 159-166 where ByteBuf objects are created but not properly released. The buffers are passed to StoreResponse via ByteBufInputStream with releaseOnClose=true, but the streams are never closed in the test, causing the ByteBuf leak. I'll fix this now.

Changed from Mono.just() to Mono.fromCallable() to defer StoreResponse creation, ensuring ByteBuf lifecycle is properly managed within each subscription rather than eagerly at mock setup time.

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Copilot AI changed the title Fix Netty ByteBuf leak detection visibility in StoreResponse Fix Netty ByteBuf leaks in StoreResponse and RetryContextOnDiagnosticTest Nov 13, 2025
Copilot finished work on behalf of FabianMeiswinkel November 13, 2025 18:46
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@FabianMeiswinkel FabianMeiswinkel marked this pull request as ready for review November 13, 2025 18:55
@FabianMeiswinkel FabianMeiswinkel merged commit 06425eb into users/fabianm/NettyBufferLeakFixes Nov 13, 2025
3 checks passed
@FabianMeiswinkel FabianMeiswinkel deleted the copilot/fix-netty-buffer-leak branch November 13, 2025 18:55
FabianMeiswinkel added a commit that referenced this pull request Nov 20, 2025
…ent instance detection during test execution (#47211)

* Test changes to add leak detection

* Update Configs.java

* Update CosmosNettyLeakDetectorFactory.java

* Update sdk/cosmos/azure-cosmos-encryption/src/test/java/com/azure/cosmos/encryption/CosmosNettyLeakDetectorFactory.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update CosmosNettyLeakDetectorFactory.java

* Fixes

* Fixes

* Update CosmosNettyLeakDetectorFactory.java

* Update RxDocumentClientImpl.java

* Fixes

* Update CosmosNettyLeakDetectorFactory.java

* Fixes

* Fixes

* Iterating on tests

* Fixing build warning

* Fixing memory leak

* Reverting production changes

* Iterating on test tools

* Cleaning-up dummy QueryFeedRangeState properly

* Update test-resources.json

* Update sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosDiagnosticsE2ETest.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [WIP] Fix Netty buffer and RxDocumentClientImpl leaks (#47213)

* Initial plan

* Improve JavaDoc phrasing in RxDocumentClientImpl

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>

* NITs

* Move static block to class level in cosmos-encryption TestSuiteBase (#47216)

* Initial plan

* Move CosmosNettyLeakDetectorFactory.ingestIntoNetty() to class-level static block

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

* Update SessionTest.java

* Update tests.yml

* Update CosmosNettyLeakDetectorFactory.java

* Test config

* Update CosmosNettyLeakDetectorFactory.java

* Updating TestNG

* Reverting TestNG to 7.9.0 (highest version still supporting Java8)

* Switching back to TestNG 7.5.1

* Enabling leak detection in unit tests

* Iterating on tests

* Update pom.xml

* Test changes (#47233)

* Update RntbdTransportClientTest.java

* Updating netty leak detection system properties

* Update CosmosNettyLeakDetectorFactory.java

* Test changes

* Prod memory leak fixes

* Test fixes

* Test fixes

* Users/fabianm/portfixes (#47252)

* Update WebExceptionRetryPolicy.java

* Update ThinClientStoreModel.java

* Test fixes

* Fix Netty ByteBuf leaks in StoreResponse and RetryContextOnDiagnosticTest (#47266)

* Initial plan

* Improve logging for ByteBufInputStream close failures

Change log level from debug to warn and catch Throwable instead of just IOException to make potential ByteBuf leak issues more visible.

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

* Fix ByteBuf leak in RetryContextOnDiagnosticTest

Changed from Mono.just() to Mono.fromCallable() to defer StoreResponse creation, ensuring ByteBuf lifecycle is properly managed within each subscription rather than eagerly at mock setup time.

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

* Update EncryptionAsyncApiCrudTest.java

* Fix ByteBuf memory leak in TcpServerMock request decoders (#47269)

* Initial plan

* Fix memory leak in ServerRntbdRequestDecoder and ServerRntbdContextRequestDecoder

When overriding channelRead() in ByteToMessageDecoder and bypassing the parent's
decode logic by calling context.fireChannelRead() directly, the ByteBuf reference
must be explicitly released to prevent memory leaks. Added ReferenceCountUtil.release()
in try-finally blocks to properly manage buffer lifecycle.

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

* Replace ReferenceCountUtil.release with safeRelease

* Replace ReferenceCountUtil.release with safeRelease

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>

* Update ThinClientStoreModel.java

* Fixing test issues

* Disable netty leak detection in RetrycontextOnDiagnosticTest

* Test and diagnostics improvements

* Test fixes and more breadcrumbs

* Test fixes

* Test fixes

* Test fixes

* Update OrderbyDocumentQueryTest.java

* Test fixes

* Reverting too noisy logs

* Test fixes

* Fixing POMs

* React to code review feedback

* Update pom.xml

* Addresses code review feedback

* Update sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestDecoder.java

Co-authored-by: Annie Liang <64233642+xinlian12@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Co-authored-by: Annie Liang <64233642+xinlian12@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants