Skip to content

Comments

[TENT] fix: Reduce unnecessary bandwidth consumption in TCP recvData operations#1513

Merged
alogfans merged 7 commits intokvcache-ai:mainfrom
00fish0:fix/tent-tcp-recvdata
Feb 9, 2026
Merged

[TENT] fix: Reduce unnecessary bandwidth consumption in TCP recvData operations#1513
alogfans merged 7 commits intokvcache-ai:mainfrom
00fish0:fix/tent-tcp-recvdata

Conversation

@00fish0
Copy link
Collaborator

@00fish0 00fish0 commented Feb 6, 2026

Description

Optimizes TCP transport in TENT backend by reducing unnecessary bandwidth consumption in recvData operations.

  • Reduced recvData RPC request size from 16 bytes + N to 16 bytes only, eliminating redundant pre-allocated buffer in requests. This cuts request-response bandwidth by ~50% for READ operations.
  • Fixed --xport_type parameter handling in TENT backend. Now properly disables unwanted transports and enables only the specified one (rdma/tcp/shm/iouring/gds/mnnvl).
  • Added missing getBaseAddr() stub for TENT backend (returns nullptr as TENT doesn't support CXL). This is required to maintain compatibility with the updated interface introduced in PR [Store][Feature] Add CXL storage for mooncake_store #1365 .

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

# Target
./tebench --xport_type=tcp --seg_type=DRAM --metadata_type=p2p

# Initiator
./tebench --target_seg_name=<addr> --xport_type=tcp --op_type=read

Before:

BlkSize (B)   Batch   BW (GB/S)     Avg Lat (us)  Avg Tx (us)   P99 Tx (us)   P999 Tx (us)  
----------------------------------------------------------------------------------------------------------------------------------------------------------------
I0207 02:50:11.476197 420608 segment_manager.cpp:44] Opened segment #1: 192.168.200.17:15822
4096          1       0.029985      136.6         135.1         188.0         343.4         
8192          1       0.059502      137.7         136.2         172.0         294.0         
16384         1       0.106295      154.1         152.5         180.0         242.6         
32768         1       0.185946      176.2         174.6         201.0         302.2         
65536         1       0.253678      258.3         256.7         325.0         503.3         
131072        1       0.392522      333.9         332.5         464.3         615.1         
262144        1       0.403777      649.2         647.5         734.0         997.0         
524288        1       0.485275      1080.4        1078.7        1199.7        2575.4        
1048576       1       0.446561      2348.1        2346.4        3317.7        45421.1       
2097152       1       0.522427      4014.2        4012.4        4274.8        46224.0       
4194304       1       0.565539      7416.5        7414.0        8616.6        25377.0       
8388608       1       0.538270      15584.4       15580.3       20437.4       57343.8       
16777216      1       0.530047      31652.3       31647.1       73797.7       74095.8       
33554432      1       0.216957      154659.5      154651.3      168331.8      171388.4      
67108864      1       0.229261      292718.4      292709.7      310151.2      311053.7  

After:

BlkSize (B)   Batch   BW (GB/S)     Avg Lat (us)  Avg Tx (us)   P99 Tx (us)   P999 Tx (us)  
----------------------------------------------------------------------------------------------------------------------------------------------------------------
I0207 02:24:33.354920 407885 segment_manager.cpp:44] Opened segment #1: 192.168.200.17:15314
4096          1       0.030621      133.8         132.2         149.0         238.2         
8192          1       0.058706      139.5         137.9         160.0         216.3         
16384         1       0.114198      143.5         141.9         161.0         314.2         
32768         1       0.186659      175.5         173.8         228.0         299.9         
65536         1       0.327960      199.8         198.1         237.0         376.8         
131072        1       0.490405      267.3         265.6         302.0         385.5         
262144        1       0.634434      413.2         411.5         474.0         583.5         
524288        1       0.878559      596.8         595.0         690.0         847.6         
1048576       1       1.064387      985.1         983.2         1313.2        2287.4        
2097152       1       1.041997      2012.6        2010.3        2544.5        3992.5        
4194304       1       1.236025      3393.4        3389.9        4625.9        8471.5        
8388608       1       1.176762      7128.5        7124.1        9491.0        12068.1       
16777216      1       1.301293      12892.7       12886.6       15687.8       18889.1       
33554432      1       0.251798      133259.1      133250.3      136573.3      136767.4      
67108864      1       0.292828      229174.7      229166.4      249890.1      249926.0      

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@00fish0 00fish0 requested a review from alogfans as a code owner February 6, 2026 19:12
Copilot AI review requested due to automatic review settings February 6, 2026 19:12
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @00fish0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on optimizing the TCP transport within the TENT backend to reduce unnecessary bandwidth consumption during recvData operations. This is achieved by streamlining the RPC request size. Additionally, it includes a crucial fix for the --xport_type parameter handling, ensuring correct transport configuration, and adds a necessary stub for CXL-related functionality.

Highlights

  • Bandwidth Optimization: Reduced the recvData RPC request size from 16 bytes + N to 16 bytes by eliminating redundant pre-allocated buffers, significantly cutting request-response bandwidth for READ operations.
  • Transport Configuration Fix: Corrected the handling of the --xport_type parameter in the TENT backend to properly disable all transports by default and enable only the specified one (rdma/tcp/shm/iouring/gds/mnnvl).
  • CXL Support Stub: Introduced a getBaseAddr() stub for the TENT backend, which returns nullptr as TENT does not support CXL.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • mooncake-transfer-engine/benchmark/tent_backend.cpp
    • Implemented logic to dynamically enable only the transport specified by --xport_type (e.g., rdma, tcp, shm) while disabling all others.
    • Corrected a minor formatting issue in the getCudaDeviceNumaID function.
  • mooncake-transfer-engine/src/transfer_engine.cpp
    • Added a getBaseAddr() method to the TransferEngine class, which returns nullptr when use_tent_ is true, indicating no CXL base address support for the TENT backend.
  • mooncake-transfer-engine/tent/src/runtime/control_plane.cpp
    • In ControlClient::sendData, added a check for non-empty RPC responses to return RpcServiceError.
    • In ControlClient::recvData, optimized the RPC request by removing the length component from its size, reducing it to sizeof(XferDataDesc). Also added a check to ensure the response size matches the expected length, returning an error if not.
    • In ControlService::onSendData, added an error response if the target address is not found in a registered buffer.
    • In ControlService::onRecvData, moved the response.resize(length) call inside the buffer finding conditional, ensuring memory is allocated only when the buffer is valid.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@00fish0 00fish0 changed the title [TENT] R、educing unnecessary bandwidth consumption in recvData operations [TENT] Reducing unnecessary bandwidth consumption in TCP recvData operations Feb 6, 2026
@00fish0 00fish0 changed the title [TENT] Reducing unnecessary bandwidth consumption in TCP recvData operations [TENT] fix: Reduce unnecessary bandwidth consumption in TCP recvData operations Feb 6, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the TENT backend’s control-plane data path by shrinking recvData RPC requests (removing redundant preallocated payload) and tightening error signaling, along with benchmark transport-selection fixes and a TENT stub for getBaseAddr().

Changes:

  • Reduce recvData RPC request size to only the descriptor header and adjust server/client behavior accordingly.
  • Add error propagation for sendData/recvData when the target address is not in a registered buffer.
  • Update TENT benchmark config to honor --xport_type by disabling/enabling transports; add TransferEngine::getBaseAddr() stub for TENT.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
mooncake-transfer-engine/tent/src/runtime/control_plane.cpp Shrinks recvData request payload; adds/adjusts RPC error handling for Send/Recv.
mooncake-transfer-engine/src/transfer_engine.cpp Adds getBaseAddr() behavior for TENT mode (returns nullptr).
mooncake-transfer-engine/benchmark/tent_backend.cpp Changes --xport_type handling to disable all transports and enable only the chosen one; minor loop formatting tweak.
Comments suppressed due to low confidence (1)

mooncake-transfer-engine/tent/src/runtime/control_plane.cpp:257

  • onRecvData() leaves response empty when the address range isn't registered, so the client can only infer failure via a size mismatch. For consistency with onSendData() and better diagnosability, set an explicit error message in the else path (and have the client surface it when present).
    if (local_desc->findBuffer(peer_mem_addr, length)) {
        response.resize(length);
        Platform::getLoader().copy(response.data(), (void*)peer_mem_addr,
                                   length);
    }
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +72 to +74
if (response.size() != length)
return Status::RpcServiceError(
"RecvData failed: target address not in registered buffer");
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recvData() treats any response size mismatch as "target address not in registered buffer" and discards the actual server response. This can misreport other failure modes (e.g., server-side error string, partial response) and makes debugging harder. Consider returning RpcServiceError(response) when !response.empty() and otherwise include expected vs actual sizes in the error message.

Suggested change
if (response.size() != length)
return Status::RpcServiceError(
"RecvData failed: target address not in registered buffer");
if (response.size() != length) {
if (!response.empty()) {
// Server likely returned an explicit error message instead of raw data.
return Status::RpcServiceError(response);
}
return Status::RpcServiceError(
"RecvData failed: response size mismatch (expected " +
std::to_string(length) + " bytes, got " +
std::to_string(response.size()) + " bytes)");
}

Copilot uses AI. Check for mistakes.
Comment on lines 58 to 71
// Enable only the specified transport
if (XferBenchConfig::xport_type == "rdma") {
config->set("transports/rdma/enable", true);
} else if (XferBenchConfig::xport_type == "tcp") {
config->set("transports/tcp/enable", true);
} else if (XferBenchConfig::xport_type == "shm") {
config->set("transports/shm/enable", true);
} else if (XferBenchConfig::xport_type == "iouring") {
config->set("transports/io_uring/enable", true);
} else if (XferBenchConfig::xport_type == "gds") {
config->set("transports/gds/enable", true);
} else if (XferBenchConfig::xport_type == "mnnvl") {
config->set("transports/mnnvl/enable", true);
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If XferBenchConfig::xport_type is set to an unexpected value, this code disables all transports and enables none, which will make the benchmark fail in a non-obvious way. Add an else branch that logs/returns an error (or falls back to defaults) when the value is not one of the supported strings.

Copilot uses AI. Check for mistakes.
Comment on lines 50 to 56
// Disable all transports by default
config->set("transports/rdma/enable", false);
config->set("transports/tcp/enable", false);
config->set("transports/shm/enable", false);
config->set("transports/io_uring/enable", false);
config->set("transports/gds/enable", false);
config->set("transports/mnnvl/enable", false);
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When xport_type is specified, this block disables only a subset of transports. Others that are enabled-by-default in the TENT transport loader (e.g. transports/nvlink/enable and transports/ascend_direct/enable) can still be loaded, so it may not actually "enable only the specified transport". Consider explicitly disabling all known transport enable keys before turning on the selected one.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an important optimization to the recvData operation by reducing RPC request size, which significantly cuts down bandwidth consumption for READ operations. It also includes a fix for the --xport_type parameter handling and adds a getBaseAddr() stub for the TENT backend. However, critical vulnerabilities have been identified in the RPC handlers onSendData and onRecvData. These functions lack sufficient validation of user-supplied memory addresses and lengths, potentially leading to arbitrary memory read/write and Denial of Service (DoS) vulnerabilities due to integer overflows in memory range checks and missing buffer size validations. Robust input validation and overflow checks are highly recommended before merging these changes.

Comment on lines 59 to 70
if (XferBenchConfig::xport_type == "rdma") {
config->set("transports/rdma/enable", true);
} else if (XferBenchConfig::xport_type == "tcp") {
config->set("transports/tcp/enable", true);
} else if (XferBenchConfig::xport_type == "shm") {
config->set("transports/shm/enable", true);
} else if (XferBenchConfig::xport_type == "iouring") {
config->set("transports/io_uring/enable", true);
} else if (XferBenchConfig::xport_type == "gds") {
config->set("transports/gds/enable", true);
} else if (XferBenchConfig::xport_type == "mnnvl") {
config->set("transports/mnnvl/enable", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The series of if-else if statements for enabling transport types can become cumbersome as more transport types are added. Consider refactoring this into a std::map<std::string, bool> where keys are transport names and values indicate enablement, or using a switch statement if XferBenchConfig::xport_type can be mapped to an enum. This would improve maintainability and scalability of the configuration logic.

        std::unordered_map<std::string, bool> transport_enable_map;
        transport_enable_map["rdma"] = false;
        transport_enable_map["tcp"] = false;
        transport_enable_map["shm"] = false;
        transport_enable_map["iouring"] = false;
        transport_enable_map["gds"] = false;
        transport_enable_map["mnnvl"] = false;

        if (transport_enable_map.count(XferBenchConfig::xport_type)) {
            transport_enable_map[XferBenchConfig::xport_type] = true;
        }

        for (const auto& entry : transport_enable_map) {
            config->set("transports/" + entry.first + "/enable", entry.second);
        }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this comment is better if extending transports.

fatSheep added 2 commits February 7, 2026 03:20
- Add request size validation in onSendData to prevent buffer over-read
- Add length upper bound check in onRecvData to prevent DoS via memory exhaustion
- Add integer overflow checks in findBuffer for memory range validation

Fixes security vulnerabilities identified by Gemini Code Assist review
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Collaborator

@alogfans alogfans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM.

@alogfans alogfans merged commit fd2da4f into kvcache-ai:main Feb 9, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants