Skip to content

Comments

[TENT] Fix Memory Management Issues#1486

Open
alogfans wants to merge 3 commits intokvcache-ai:mainfrom
alogfans:fix-tent-memory-leak
Open

[TENT] Fix Memory Management Issues#1486
alogfans wants to merge 3 commits intokvcache-ai:mainfrom
alogfans:fix-tent-memory-leak

Conversation

@alogfans
Copy link
Collaborator

@alogfans alogfans commented Feb 4, 2026

Description

Fixed two critical memory management issues in TENT (Transfer Engine):

  1. Slab memory pool static destruction order issue - Potential use-after-free during program exit
  2. MNNVL transport resource leaks - CUDA memory, handles, and file descriptors not properly released

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alogfans, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical memory management issues within the TENT (Transfer Engine) component. It specifically addresses a potential use-after-free scenario in the slab memory pool during program exit and rectifies resource leaks in the MNNVL transport mechanism by ensuring proper release of CUDA memory, allocation handles, and file descriptors.

Highlights

  • Slab Memory Pool Fix: Addressed a potential use-after-free issue in the slab memory pool during program exit by introducing an atomic 'destroyed' flag. This flag prevents further allocations or deallocations on a slab that is being destroyed, ensuring proper shutdown behavior.
  • MNNVL Transport Resource Cleanup: Implemented comprehensive cleanup for MNNVL transport resources. The uninstall() method now correctly unmaps CUDA memory, frees address reservations, releases CUDA memory handles, and closes imported file descriptors, preventing resource leaks.
  • MNNVL Resource Tracking: Enhanced the MnnvlEntry structure to track CUmemGenericAllocationHandle, a boolean indicating handle validity, and the imported_fd for each shared memory entry. This ensures that all necessary resources are properly identified and released during the uninstall process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • mooncake-transfer-engine/tent/include/tent/runtime/slab.h
    • Added std::atomic<bool> destroyed_ to SlabBase to track the destruction state.
    • Modified SlabBase destructor to set the destroyed_ flag and acquire a lock during cleanup to prevent race conditions.
  • mooncake-transfer-engine/tent/include/tent/transport/mnnvl/mnnvl_transport.h
    • Extended MnnvlEntry struct to include handle (CUmemGenericAllocationHandle), has_handle (bool), and imported_fd (int) for improved resource tracking.
  • mooncake-transfer-engine/tent/src/runtime/slab.cpp
    • Implemented checks for the destroyed_ flag in SlabBase::allocate() and SlabBase::deallocate() to prevent operations on a slab that is being destroyed.
    • Corrected a variable shadowing issue within SlabBase::deallocate().
  • mooncake-transfer-engine/tent/src/transport/mnnvl/mnnvl_transport.cpp
    • Added <unistd.h> for the close() function.
    • Implemented comprehensive resource cleanup in MnnvlTransport::uninstall(), including unmapping CUDA memory, freeing address reservations, releasing memory handles, and closing file descriptors.
    • Updated MnnvlTransport::relocateSharedMemoryAddress() to correctly track and store imported_fd, handle, and has_handle in MnnvlEntry instances.
Activity
  • The author, alogfans, created this pull request to fix memory management issues.
  • The pull request is categorized as a bug fix.
  • The author has performed a self-review of the code.
  • The code has been formatted using ./scripts/code_format.sh.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to resolve a use-after-free bug in the Slab memory pool and resource leaks in the MnnvlTransport. While addressing these, it unfortunately introduces new Use-After-Free vulnerabilities due to race conditions and stale thread-local caches, particularly in the destruction logic for SlabBase and the uninstallation process for MnnvlTransport, which do not safely handle concurrent access. Furthermore, the new resource handling logic in relocateSharedMemoryAddress can lead to leaks on certain error paths.

Comment on lines 143 to 149
cuMemUnmap((CUdeviceptr)entry.mnnvl_addr, entry.length);
}

// Release the address reservation
if (entry.mnnvl_addr) {
cuMemAddressFree((CUdeviceptr)entry.mnnvl_addr, entry.length);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

This section introduces a Use-After-Free vulnerability. While the memory unmapping and address freeing in uninstall() correctly prevent resource leaks, they do not account for other threads that may have cached these addresses in their thread_local tl_relocate_map. When uninstall() unmaps the memory, any thread using a cached address from its local tl_relocate_map will encounter a Use-After-Free or illegal memory access. The thread_local caches must be invalidated or cleared when the transport is uninstalled. For better readability, the separate checks for entry.mnnvl_addr could also be combined.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants