[TENT] Fix Memory Management Issues by alogfans · Pull Request #1486 · kvcache-ai/Mooncake

alogfans · 2026-02-04T03:31:44Z

Description

Fixed two critical memory management issues in TENT (Transfer Engine):

Slab memory pool static destruction order issue - Potential use-after-free during program exit
MNNVL transport resource leaks - CUDA memory, handles, and file descriptors not properly released

Type of Change

Types
- Bug fix
- New feature
  - Transfer Engine
  - Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

How Has This Been Tested?

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist · 2026-02-04T03:31:59Z

Summary of Changes

Hello @alogfans, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical memory management issues within the TENT (Transfer Engine) component. It specifically addresses a potential use-after-free scenario in the slab memory pool during program exit and rectifies resource leaks in the MNNVL transport mechanism by ensuring proper release of CUDA memory, allocation handles, and file descriptors.

Highlights

Slab Memory Pool Fix: Addressed a potential use-after-free issue in the slab memory pool during program exit by introducing an atomic 'destroyed' flag. This flag prevents further allocations or deallocations on a slab that is being destroyed, ensuring proper shutdown behavior.
MNNVL Transport Resource Cleanup: Implemented comprehensive cleanup for MNNVL transport resources. The uninstall() method now correctly unmaps CUDA memory, frees address reservations, releases CUDA memory handles, and closes imported file descriptors, preventing resource leaks.
MNNVL Resource Tracking: Enhanced the MnnvlEntry structure to track CUmemGenericAllocationHandle, a boolean indicating handle validity, and the imported_fd for each shared memory entry. This ensures that all necessary resources are properly identified and released during the uninstall process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

mooncake-transfer-engine/tent/include/tent/runtime/slab.h
- Added std::atomic<bool> destroyed_ to SlabBase to track the destruction state.
- Modified SlabBase destructor to set the destroyed_ flag and acquire a lock during cleanup to prevent race conditions.
mooncake-transfer-engine/tent/include/tent/transport/mnnvl/mnnvl_transport.h
- Extended MnnvlEntry struct to include handle (CUmemGenericAllocationHandle), has_handle (bool), and imported_fd (int) for improved resource tracking.
mooncake-transfer-engine/tent/src/runtime/slab.cpp
- Implemented checks for the destroyed_ flag in SlabBase::allocate() and SlabBase::deallocate() to prevent operations on a slab that is being destroyed.
- Corrected a variable shadowing issue within SlabBase::deallocate().
mooncake-transfer-engine/tent/src/transport/mnnvl/mnnvl_transport.cpp
- Added <unistd.h> for the close() function.
- Implemented comprehensive resource cleanup in MnnvlTransport::uninstall(), including unmapping CUDA memory, freeing address reservations, releasing memory handles, and closing file descriptors.
- Updated MnnvlTransport::relocateSharedMemoryAddress() to correctly track and store imported_fd, handle, and has_handle in MnnvlEntry instances.

Activity

The author, alogfans, created this pull request to fix memory management issues.
The pull request is categorized as a bug fix.
The author has performed a self-review of the code.
The code has been formatted using ./scripts/code_format.sh.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to resolve a use-after-free bug in the Slab memory pool and resource leaks in the MnnvlTransport. While addressing these, it unfortunately introduces new Use-After-Free vulnerabilities due to race conditions and stale thread-local caches, particularly in the destruction logic for SlabBase and the uninstallation process for MnnvlTransport, which do not safely handle concurrent access. Furthermore, the new resource handling logic in relocateSharedMemoryAddress can lead to leaks on certain error paths.

mooncake-transfer-engine/tent/src/transport/mnnvl/mnnvl_transport.cpp

gemini-code-assist · 2026-02-04T03:34:28Z

mooncake-transfer-engine/tent/src/transport/mnnvl/mnnvl_transport.cpp

+                    cuMemUnmap((CUdeviceptr)entry.mnnvl_addr, entry.length);
+                }
+
+                // Release the address reservation
+                if (entry.mnnvl_addr) {
+                    cuMemAddressFree((CUdeviceptr)entry.mnnvl_addr, entry.length);
+                }


This section introduces a Use-After-Free vulnerability. While the memory unmapping and address freeing in uninstall() correctly prevent resource leaks, they do not account for other threads that may have cached these addresses in their thread_local tl_relocate_map. When uninstall() unmaps the memory, any thread using a cached address from its local tl_relocate_map will encounter a Use-After-Free or illegal memory access. The thread_local caches must be invalidated or cleared when the transport is uninstalled. For better readability, the separate checks for entry.mnnvl_addr could also be combined.

mooncake-transfer-engine/tent/src/runtime/slab.cpp

codecov-commenter · 2026-02-04T06:11:45Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

[TENT] Fix Memory Management Issues

5680871

alogfans requested review from chestnut-Q and doujiang24 as code owners February 4, 2026 03:31

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

github-actions bot added run-ci Transfer Engine labels Feb 4, 2026

alogfans added 2 commits February 4, 2026 04:00

Fix safe issues

975efe6

Fix TENT build problem

a3107bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[TENT] Fix Memory Management Issues#1486

[TENT] Fix Memory Management Issues#1486
alogfans wants to merge 3 commits intokvcache-ai:mainfrom
alogfans:fix-tent-memory-leak

alogfans commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

alogfans commented Feb 4, 2026

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Feb 4, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants