Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 14, 2025

Description

Fixed Netty buffer leak in RxGatewayStoreModel.java where retained buffers were not properly released during cancellation scenarios, as identified in OrderbyDocumentQueryTest leak detection logs.

Root Cause Analysis

After investigating the leak logs, the actual issue was identified in RxGatewayStoreModel.java, not in ReactorNettyClient.java as initially suspected.

Evidence from Leak Logs:

  • Buffer created at: reactor.netty.ByteBufFlux.lambda$aggregate$9(ByteBufFlux.java:276)
  • Hint: "RxGatewayStoreModel - touch retained buffer - refCnt: 2"
  • Buffer has refCnt=2 after retain() at RxGatewayStoreModel.java:434

The Problem:

  1. Line 425: httpResponse.body() returns aggregated ByteBuf with refCnt=1
  2. Line 434: bodyByteBuf.retain() increases refCnt to 2 for downstream work
  3. Line 444: First doOnDiscard registered on stream before map operation
  4. Line 478: unwrapToStoreResponse() takes ownership and releases when successful
  5. Bug: During cancellation after retain but before unwrap, the first doOnDiscard releases once (refCnt→1), but the retained reference leaks

Changes Made

RxGatewayStoreModel.java (Primary Fix)

  • Added doOnDiscard after the map operation to handle cancellation/discard of retained buffer before unwrapToStoreResponse takes ownership
  • Ensures both buffer references are properly released:
    • Original aggregate buffer reference (handled by first doOnDiscard)
    • Retained reference for downstream work (handled by new doOnDiscard)

ReactorNettyClient.java (Code Quality Improvements)

  • Separated leakDetectionDebuggingEnabled check (for touch/logging) from refCnt > 0 check (for release)
  • Changed .map() to .doOnNext() in releaseOnNotSubscribedResponse() for semantic correctness
  • These changes improve code clarity and prevent noisy logs but are not the primary leak fix

Testing

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.
Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG]Investigate Netty buffer leak in OrderbyDocumentQueryTest</issue_title>
<issue_description>Describe the bug
Netty leak detected - records attached</issue_description>

<agent_instructions>Please investigate the attached netty leak records and propose changes to either debug why the leaks happen or fix them - the relevant code (including the pom.xmls for netty versione etc.) are located in sdk/cosmos/azure-cosmos</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@FabianMeiswinkel [OrderbyDocumentQueryTest.txt](https://github.com/user-attachments/files/23556916/OrderbyDocumentQueryTest.txt)

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

FabianMeiswinkel and others added 30 commits November 7, 2025 20:49
…mos/encryption/CosmosNettyLeakDetectorFactory.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…osmosDiagnosticsE2ETest.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Initial plan

* Improve JavaDoc phrasing in RxDocumentClientImpl

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Co-authored-by: Fabian Meiswinkel <fabianm@microsoft.com>
…47216)

* Initial plan

* Move CosmosNettyLeakDetectorFactory.ingestIntoNetty() to class-level static block

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Copilot AI changed the title [WIP] Investigate Netty buffer leak in OrderbyDocumentQueryTest Fix Netty buffer leaks in HTTP response handling Nov 14, 2025
Copilot finished work on behalf of FabianMeiswinkel November 14, 2025 23:26
@FabianMeiswinkel
Copy link
Member

FabianMeiswinkel commented Nov 14, 2025

@copilot Please also consider these netty record leak logs for same test in different pipeline - mihgt give more insights and ways to validate any changes

OrderbyDocumentQueryTest_2.txt

@FabianMeiswinkel
Copy link
Member

@copilot sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java:361

safeRelease should still only happen when refCnt > 0 to aovid moisy logs - but touch and logger.info should have separate if with leakDetectionEnabled.

This is not the real problem thought because the logs sow that leakDetectionDbeuggingEnabled == true in the run

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Nov 14, 2025

@copilot sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java:361

safeRelease should still only happen when refCnt > 0 to aovid moisy logs - but touch an...

Fixed in commit b120443. Separated the conditionals:

  • touch and logger.info now only check leakDetectionDebuggingEnabled
  • safeRelease now only checks buf.refCnt() > 0

This avoids noisy logs while still performing proper leak detection when enabled.

Copilot AI changed the title Fix Netty buffer leaks in HTTP response handling Improve Netty buffer release logic in HTTP response handling Nov 14, 2025
Copilot finished work on behalf of FabianMeiswinkel November 14, 2025 23:41
@FabianMeiswinkel
Copy link
Member

@copilot Fine - but this is not addressing the issue descirbed in the bug description - leakDetectionDebuggingEnabled was true there accoridng to the logs

Copy link
Contributor Author

Copilot AI commented Nov 14, 2025

@copilot Fine - but this is not addressing the issue descirbed in the bug description - leakDetectionDebuggingEnabled was true there accoridng to the logs

You're absolutely right. After reviewing both leak logs more carefully, the issue is NOT in ReactorNettyClient.java.

The leak logs show:

  1. Buffer created at reactor.netty.ByteBufFlux.lambda$aggregate$9
  2. Hint: "RxGatewayStoreModel - touch retained buffer - refCnt: 2"
  3. The buffer has refCnt=2 after retain() at RxGatewayStoreModel.java:434

The real leak is in RxGatewayStoreModel.java where:

  • Line 434: bodyByteBuf.retain() increases refCnt to 2
  • The buffer is passed to unwrapToStoreResponse() which takes ownership
  • But the buffer may not be properly released in all code paths, especially during cancellations

My changes to ReactorNettyClient were addressing the wrong location based on misunderstanding the leak source. I need to investigate RxGatewayStoreModel's buffer lifecycle instead.

Co-authored-by: FabianMeiswinkel <19165014+FabianMeiswinkel@users.noreply.github.com>
Copilot AI changed the title Improve Netty buffer release logic in HTTP response handling Fix Netty buffer leak in RxGatewayStoreModel during cancellation Nov 15, 2025
Copilot finished work on behalf of FabianMeiswinkel November 15, 2025 00:00
Base automatically changed from users/fabianm/NettyBufferLeakFixes to main November 20, 2025 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants