Skip to content

Comments

[TENT] Improve tebench: GPU selection, graceful interruption, and build fixes#1537

Merged
alogfans merged 9 commits intokvcache-ai:mainfrom
staryxchen:dev/tebench
Feb 12, 2026
Merged

[TENT] Improve tebench: GPU selection, graceful interruption, and build fixes#1537
alogfans merged 9 commits intokvcache-ai:mainfrom
staryxchen:dev/tebench

Conversation

@staryxchen
Copy link
Collaborator

Description

This PR includes several improvements to the tebench benchmark tool:

Features

  • Support specifying a single GPU via --local_gpu_id / --target_gpu_id, or -1 for all GPUs.
  • Add --rpc_server_port flag to configure the RPC server port (0 = auto-select).
  • Support graceful interruption: when a signal is received, the benchmark loop exits cleanly instead of hanging.

Refactor

  • Unify DRAM and VRAM buffer allocation logic in TENTBenchRunner::allocateBuffers(), eliminating duplicated code across the two paths.

Build

  • Set RPATH ($ORIGIN/../lib) on the tebench executable so it can find libasio.so at runtime.
  • Add missing <csignal> include in rpc.h.

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

- Include csignal for signal handling in RPC module

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Configure tebench target to use relative RPATH for runtime library resolution
- Ensure libasio.so can be found at runtime from the ../lib director

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Update runners to return error when stopped
- Add early exit logic in main loop

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Update flags to accept -1 for all GPUs.
- Adjust buffer allocation and address calculation for specific GPU selection.

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Extract common logic for device prefix and buffer count
- Merge allocation and registration loops to reduce code duplication
- Improve variable const-correctness

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Add`rpc_server_port`field to`XferBenchConfig`
- Load port value from command line flags

Signed-off-by: staryxchen <staryxchen@tencent.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @staryxchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the tebench benchmark tool by introducing greater flexibility in GPU selection and RPC server configuration, improving its robustness with graceful interruption handling, and streamlining its internal buffer allocation mechanism. These changes aim to make the benchmark more user-friendly, reliable, and maintainable, while also addressing build-related issues for better deployment.

Highlights

  • Enhanced GPU Selection: Added options to specify a single GPU via --local_gpu_id and --target_gpu_id, or to use all available GPUs by setting the ID to -1.
  • Configurable RPC Server Port: Introduced a new flag, --rpc_server_port, allowing users to specify the RPC server port, with 0 enabling auto-selection.
  • Graceful Interruption Handling: Implemented logic to ensure the benchmark loop exits cleanly upon receiving a signal, preventing the benchmark from hanging.
  • Unified Buffer Allocation Logic: Refactored TENTBenchRunner::allocateBuffers() to consolidate DRAM and VRAM buffer allocation into a single, unified code path, reducing duplication.
  • Build System Improvements: Configured RPATH for the tebench executable to correctly locate libasio.so at runtime and added a missing <csignal> include in rpc.h.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • mooncake-transfer-engine/benchmark/CMakeLists.txt
    • Corrected if statement formatting for CUDAToolkit_FOUND.
    • Added RPATH configuration for the tebench executable to ensure libasio.so is found at runtime.
  • mooncake-transfer-engine/benchmark/main.cpp
    • Modified processBatchSizes function to return an int for interruption status.
    • Adjusted GPU ID calculation to support -1 for all GPUs and respect local_gpu_id/target_gpu_id offsets.
    • Implemented interruption checks in the main benchmark loops to allow graceful exit.
  • mooncake-transfer-engine/benchmark/te_backend.cpp
    • Updated VRAM buffer allocation logic to respect local_gpu_id for single GPU selection.
    • Modified runInitiatorTasks to return an integer indicating if an interruption occurred.
  • mooncake-transfer-engine/benchmark/tent_backend.cpp
    • Integrated rpc_server_port into the configuration loading.
    • Refactored allocateBuffers to unify DRAM and VRAM buffer allocation logic, reducing redundancy.
    • Modified runInitiatorTasks to return an integer indicating if an interruption occurred.
  • mooncake-transfer-engine/benchmark/utils.cpp
    • Updated descriptions for local_gpu_id and target_gpu_id flags to mention -1 for all GPUs.
    • Added a new rpc_server_port flag and its corresponding static member and loading logic.
  • mooncake-transfer-engine/benchmark/utils.h
    • Declared rpc_server_port as a static member in XferBenchConfig.
  • mooncake-transfer-engine/tent/include/tent/rpc/rpc.h
    • Included <csignal> header for signal handling.
Activity
  • No specific activity (comments, reviews, progress updates) has been provided for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several valuable improvements to the tebench tool, including GPU selection, graceful interruption handling, and build fixes. The refactoring to unify buffer allocation logic in TENTBenchRunner::allocateBuffers is a notable improvement to code quality, reducing duplication. However, the new GPU selection logic lacks sufficient validation for negative ID values (other than the special -1 value), which can lead to out-of-bounds memory access and potential crashes or undefined behavior. It is recommended to validate that local_gpu_id and target_gpu_id are either -1 or non-negative before they are used as offsets or device indices. Additionally, there is a minor suggestion regarding error handling to improve robustness.

Comment on lines 47 to 53
auto target_gpu_offset = XferBenchConfig::target_gpu_id == -1
? 0
: XferBenchConfig::target_gpu_id;
uint64_t local_addr = runner.getLocalBufferBase(
local_gpu_offset + thread_id, max_block_size, max_batch_size);
uint64_t target_addr = runner.getTargetBufferBase(
XferBenchConfig::target_gpu_id + thread_id, max_block_size,
max_batch_size);
target_gpu_offset + thread_id, max_block_size, max_batch_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The logic for target_gpu_offset allows XferBenchConfig::target_gpu_id to be a negative value other than -1. When such a value is provided, target_gpu_offset + thread_id can be negative. This value is passed to runner.getTargetBufferBase, which in both TEBenchRunner and TENTBenchRunner implementations uses it as an index into an array after a modulo operation. In C++, the result of the modulo operator % with a negative operand is negative, resulting in an out-of-bounds array access to info_->buffers or pinned_buffer_list_.

Comment on lines 101 to 107
if (XferBenchConfig::local_gpu_id != -1) {
start_idx = XferBenchConfig::local_gpu_id;
num_buffers = 1;
LOG_ASSERT(start_idx < gpu_count)
<< "local_gpu_id " << start_idx << " exceeds device count "
<< gpu_count;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Similar to the issue in te_backend.cpp, local_gpu_id is not validated to be non-negative. While TENTBenchRunner uses CHECK_FAIL to handle allocation errors, providing a negative GPU ID will still cause the benchmark to fail and exit unexpectedly due to an invalid device location string (e.g., 'cuda:-2') being passed to the transfer engine.

@staryxchen
Copy link
Collaborator Author

I am testing the performance of TENT, and the current test results cannot reach the theoretical bandwidth limit. The original TE was able to achieve this, see #675
@alogfans Do you have any idea?

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Replace ternary operators with`std::max`for offset calculation
- Update assertions to check for non-negative GPU IDs

Signed-off-by: staryxchen <staryxchen@tencent.com>
Signed-off-by: staryxchen <staryxchen@tencent.com>
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Collaborator

@alogfans alogfans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alogfans alogfans merged commit 1eec5a5 into kvcache-ai:main Feb 12, 2026
16 checks passed
@alogfans
Copy link
Collaborator

@staryxchen The original test benchmark uses GPU VRAM (#675), while the tebench may not use it by default.

@staryxchen
Copy link
Collaborator Author

staryxchen commented Feb 12, 2026

@staryxchen The original test benchmark uses GPU VRAM (#675), while the tebench may not use it by default.

@alogfans I specified seg_type as VRAM, the specific command used and the result is as follows:

# target
./tebench --backend=tent --seg_name=$IP --seg_type=VRAM --max_block_size=$((1*1024*1024)) --local_gpu_id=-1 --start_block_size=$((1*1024*1024)) --max_num_threads=128 --start_num_threads=128 --total_buffer_size=$((4*1024*1024*1024)) --start_batch_size=16 --max_batch_size=16 --rpc_server_port=16568
# initiator
./tebench --backend=tent --target_seg_name=$IP:16568 --seg_type=VRAM --max_block_size=$((1*1024*1024)) --local_gpu_id=-1 --start_block_size=$((1*1024*1024)) --duration=30 --max_num_threads=128 --start_num_threads=128 --op_type=write --total_buffer_size=$((4*1024*1024*1024)) --start_batch_size=16 --max_batch_size=16
Clipboard_Screenshot_1770879094

I believe there's still significant room for improvement. Have I missed any important configuration?

JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 12, 2026
…ld fixes (kvcache-ai#1537)

* build: add csignal header to rpc.h

- Include csignal for signal handling in RPC module

Signed-off-by: staryxchen <staryxchen@tencent.com>

* build(benchmark): set RPATH for tebench executable

- Configure tebench target to use relative RPATH for runtime library resolution
- Ensure libasio.so can be found at runtime from the ../lib director

Signed-off-by: staryxchen <staryxchen@tencent.com>

* feat(benchmark): support graceful interruption during execution

- Update runners to return error when stopped
- Add early exit logic in main loop

Signed-off-by: staryxchen <staryxchen@tencent.com>

* feat(benchmark): support specifying single GPU ID or all GPUs

- Update flags to accept -1 for all GPUs.
- Adjust buffer allocation and address calculation for specific GPU selection.

Signed-off-by: staryxchen <staryxchen@tencent.com>

* refactor(benchmark): unify buffer allocation logic for DRAM and VRAM

- Extract common logic for device prefix and buffer count
- Merge allocation and registration loops to reduce code duplication
- Improve variable const-correctness

Signed-off-by: staryxchen <staryxchen@tencent.com>

* feat(benchmark): add rpc server port configuration

- Add`rpc_server_port`field to`XferBenchConfig`
- Load port value from command line flags

Signed-off-by: staryxchen <staryxchen@tencent.com>

* Update mooncake-transfer-engine/benchmark/tent_backend.cpp

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* refactor(benchmark): simplify gpu id validation and offset calculation

- Replace ternary operators with`std::max`for offset calculation
- Update assertions to check for non-negative GPU IDs

Signed-off-by: staryxchen <staryxchen@tencent.com>

* retrigger CI

Signed-off-by: staryxchen <staryxchen@tencent.com>

---------

Signed-off-by: staryxchen <staryxchen@tencent.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants