Skip to content

Comments

[TE] Support TCP fallback in EFA build and improve EFA documentation#1523

Merged
stmatengss merged 2 commits intokvcache-ai:mainfrom
whn09:efa-tcp-cuda-support
Feb 11, 2026
Merged

[TE] Support TCP fallback in EFA build and improve EFA documentation#1523
stmatengss merged 2 commits intokvcache-ai:mainfrom
whn09:efa-tcp-cuda-support

Conversation

@whn09
Copy link
Contributor

@whn09 whn09 commented Feb 9, 2026

Description

When building Mooncake with USE_EFA=ON, the TransferEngine is created with auto_discover=false to prevent automatic RDMA transport installation (since ibverbs QP creation fails on EFA devices with error 95 EOPNOTSUPP). However, this also prevents TCP transport from being installed when mooncake_protocol is set to "tcp", causing "Local segment descriptor not found" errors.

This PR adds an else branch in the #ifdef USE_EFA block of transfer_engine_py.cpp to explicitly install TCP transport when the requested protocol is not "efa".

Additionally, the documentation is updated to:

  • Add -DUSE_EFA to the Advanced Compile Options in build.md and clarify that USE_CUDA defaults to OFF and is required for GPU memory transfers (including over TCP)
  • Add EFA as a supported protocol in supported-protocols.md
  • Add -DUSE_CUDA=ON to the EFA build command in efa_transport.md and document that it is required for GPU memory (e.g., KV cache in vLLM)

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Tested on two AWS p5en.48xlarge instances (8× H200 GPUs each) running vLLM disaggregated prefill-decode serving with Kimi-K2.5 model:

  1. EFA protocol (mooncake_protocol: "efa"): 128 requests, 0 failures.
  2. TCP fallback (mooncake_protocol: "tcp") with USE_CUDA=ON and USE_EFA=ON: 128 requests, 0 failures.
    Without this fix, TCP fallback fails with "Local segment descriptor not found" (no transport installed) or "Bad address" EFAULT (if USE_CUDA=OFF, GPU memory not handled).

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

When building with USE_EFA=ON, auto_discover is disabled to prevent
RDMA transport installation (QP creation fails on EFA devices). This
means TCP transport is also not installed automatically. Add explicit
TCP transport installation for non-EFA protocols in the EFA build path.

Documentation changes:
- build.md: Add USE_EFA option and clarify USE_CUDA default/purpose
- supported-protocols.md: Add EFA as a supported protocol
- efa_transport.md: Add USE_CUDA=ON to build command, document GPU
  memory requirement

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @whn09, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness and clarity of Mooncake's networking capabilities, particularly in AWS EFA environments. It fixes a critical bug preventing TCP fallback when EFA is enabled, ensuring that users can reliably switch between EFA and TCP protocols. Concurrently, it enhances the documentation to provide comprehensive instructions and explanations for building with EFA and managing GPU memory transfers with CUDA, thereby streamlining the user experience and preventing common configuration pitfalls.

Highlights

  • Bug Fix: TCP Fallback in EFA Builds: Resolved an issue where TCP transport failed to install in Mooncake builds with USE_EFA=ON when mooncake_protocol was set to "tcp". This previously led to "Local segment descriptor not found" errors because auto_discover is disabled for EFA builds to prevent RDMA installation failures.
  • Explicit TCP Transport Installation: Modified transfer_engine_py.cpp to explicitly install TCP transport when USE_EFA is enabled and the requested protocol is not EFA, ensuring proper fallback functionality.
  • Enhanced Documentation for EFA and CUDA: Updated build.md, supported-protocols.md, and efa_transport.md to provide clearer guidance on EFA build options, USE_CUDA requirements for GPU memory transfers (even with TCP), and detailed information about EFA as a supported protocol.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/source/design/transfer-engine/efa_transport.md
    • Updated the EFA build command to explicitly include -DUSE_CUDA=ON.
    • Added a note clarifying that -DUSE_CUDA=ON is essential for GPU memory transfers, especially when TCP is used as a fallback, to prevent "Bad address" errors.
  • docs/source/getting_started/build.md
    • Revised the description for -DUSE_CUDA to highlight its role in enabling GPU memory support for various transports, including GPU-aware TCP, and specified its default OFF state.
    • Added -DUSE_EFA to the advanced compile options, detailing its purpose and linking to the EFA Transport documentation.
  • docs/source/getting_started/supported-protocols.md
    • Introduced EFA (AWS Elastic Fabric Adapter) as a primary supported protocol, including its description, use cases, configuration, build requirements, advantages, and limitations.
    • Added EFA to the protocol recommendation table for AWS Cloud environments.
  • mooncake-integration/transfer_engine/transfer_engine_py.cpp
    • Implemented an else branch within the USE_EFA conditional block to explicitly install TCP transport when the protocol is not EFA, addressing the issue of missing TCP transport when auto_discover is disabled.
Activity
  • The author whn09 initiated this pull request.
  • The changes include a bug fix, a new feature for the Transfer Engine, and documentation updates.
  • The solution was tested on two AWS p5en.48xlarge instances, successfully verifying both EFA protocol and TCP fallback with USE_CUDA=ON.
  • The author performed a self-review, formatted the code, updated documentation, and confirmed that tests were added to validate the changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of TCP fallback in EFA-enabled builds by explicitly installing the TCP transport when auto_discover is disabled. The accompanying documentation updates are excellent, providing clear instructions and necessary context for building with EFA and CUDA support. I have one suggestion in transfer_engine_py.cpp to make the fallback logic more generic, which would improve the code's flexibility for handling other non-EFA protocols in the future.

Comment on lines +208 to 220
} else {
// For non-EFA protocols (e.g. TCP), manually install TCP transport
// since auto_discover is disabled to prevent RDMA installation
// (RDMA QP creation fails on EFA devices).
LOG(INFO)
<< "Installing TCP transport (auto_discover disabled in EFA build)";
auto transport = engine_->installTransport("tcp", nullptr);
if (!transport) {
LOG(ERROR) << "Failed to install TCP transport";
return -1;
}
LOG(INFO) << "TCP transport installed successfully";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else block is intended for non-EFA protocols when building with USE_EFA=ON. However, it currently hardcodes the installation of the "tcp" transport, regardless of the protocol parameter passed to the function. This limits the flexibility of using other transports (besides "efa" and "tcp") in an EFA-enabled build.

To make this more generic and align with the function's parameters, you should use the proto variable to install the requested transport.

    } else {
        // For non-EFA protocols, manually install the requested transport
        // since auto_discover is disabled to prevent RDMA installation
        // (RDMA QP creation fails on EFA devices).
        LOG(INFO)
            << "Installing " << proto << " transport (auto_discover disabled in EFA build)";
        auto transport = engine_->installTransport(proto, nullptr);
        if (!transport) {
            LOG(ERROR) << "Failed to install " << proto << " transport";
            return -1;
        }
        LOG(INFO) << proto << " transport installed successfully";
    }

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@stmatengss stmatengss merged commit 3cdbbdc into kvcache-ai:main Feb 11, 2026
16 checks passed
qiuweit7 pushed a commit to openanolis/Mooncake that referenced this pull request Feb 13, 2026
…vcache-ai#1523)

When building with USE_EFA=ON, auto_discover is disabled to prevent
RDMA transport installation (QP creation fails on EFA devices). This
means TCP transport is also not installed automatically. Add explicit
TCP transport installation for non-EFA protocols in the EFA build path.

Documentation changes:
- build.md: Add USE_EFA option and clarify USE_CUDA default/purpose
- supported-protocols.md: Add EFA as a supported protocol
- efa_transport.md: Add USE_CUDA=ON to build command, document GPU
  memory requirement

Co-authored-by: whn09 <whn09@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>
stmatengss added a commit that referenced this pull request Feb 24, 2026
* [Store] feat: CXL storage full features.

* fix(ci): resolve cxl test failure and code format

* fix(ci): add CXL protocol support and fix code format issues

* [Store] feat: CXL storage full features, reset and rm extern/pybind

* [TE] Support TCP fallback in EFA build and improve EFA documentation (#1523)

When building with USE_EFA=ON, auto_discover is disabled to prevent
RDMA transport installation (QP creation fails on EFA devices). This
means TCP transport is also not installed automatically. Add explicit
TCP transport installation for non-EFA protocols in the EFA build path.

Documentation changes:
- build.md: Add USE_EFA option and clarify USE_CUDA default/purpose
- supported-protocols.md: Add EFA as a supported protocol
- efa_transport.md: Add USE_CUDA=ON to build command, document GPU
  memory requirement

Co-authored-by: whn09 <whn09@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>

* [Store] Optimize BucketStorageBackend for reduced lock contention and add delete safety (#1456)

* fix(ci): resolve cxl test failure and code format

* [Store] Add Local Cache Mechanism for Mooncake Store Client (#1226)

* feat(Store): add local hot cache for client

* feat(Store): add client local hot cache log to show performance

* fix: local hot cache initialize bug

* fix(Store): Mooncake put slice is max 16MB, so make local hot cache block 16MB

* feat(Store): move local hot cache initialization to Client::Create

* feat(Store):  local hot cache remove unused small block implementation

* feat(Store): add client local hot cache unit test

* fix(Store): modify client local hot cache suit with v0.3.7

* feat(Store): change local hot cache unit tes

* fix: initialize local hot cache with negative value

* feat: use in process master and metadata fro local hot cache unit test.

* feat: update local hot cache to one replica one slice version

* fix: local hot cache unit test use in process master service

* fix: code style fix

* fix: fix dirty read when client wants to read a previously hitted hot block but the hot block is modified by incoming put actions

* fix: local hot cache unit test use in process master service

* fix: code format fix

* fix: fix comment problems for

* feat: add local hot asynchronous queue size limit

* fix: local hot cache task involves the block so that there is no memcpy operation when inserting local hot cache

* fix: code check fix

* fix: update block in_use prop to reference count

---------

Co-authored-by: shichangzhang064 <zhangshichang@h-partners.com>

* fix(ci): add CXL protocol support and fix code format issues

* fix: address comments from code review

* fix(ci): resolve cxl test failure

* fix(ci): resolve ci error

---------

Co-authored-by: 王鹤男 <wanghenan09@gmail.com>
Co-authored-by: whn09 <whn09@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>
Co-authored-by: Mahesh Bapatu <153306023+maheshrbapatu@users.noreply.github.com>
Co-authored-by: Shichang Zhang <77728761+Shichang-Zhang@users.noreply.github.com>
Co-authored-by: shichangzhang064 <zhangshichang@h-partners.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants