[TENT] Improve tebench: GPU selection, graceful interruption, and build fixes by staryxchen · Pull Request #1537 · kvcache-ai/Mooncake

staryxchen · 2026-02-11T11:00:13Z

Description

This PR includes several improvements to the tebench benchmark tool:

Features

Support specifying a single GPU via --local_gpu_id / --target_gpu_id, or -1 for all GPUs.
Add --rpc_server_port flag to configure the RPC server port (0 = auto-select).
Support graceful interruption: when a signal is received, the benchmark loop exits cleanly instead of hanging.

Refactor

Unify DRAM and VRAM buffer allocation logic in TENTBenchRunner::allocateBuffers(), eliminating duplicated code across the two paths.

Build

Set RPATH ($ORIGIN/../lib) on the tebench executable so it can find libasio.so at runtime.
Add missing <csignal> include in rpc.h.

Type of Change

Types
- Bug fix
- New feature
  - Transfer Engine
  - Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

How Has This Been Tested?

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

- Include csignal for signal handling in RPC module Signed-off-by: staryxchen <staryxchen@tencent.com>

- Configure tebench target to use relative RPATH for runtime library resolution - Ensure libasio.so can be found at runtime from the ../lib director Signed-off-by: staryxchen <staryxchen@tencent.com>

- Update runners to return error when stopped - Add early exit logic in main loop Signed-off-by: staryxchen <staryxchen@tencent.com>

- Update flags to accept -1 for all GPUs. - Adjust buffer allocation and address calculation for specific GPU selection. Signed-off-by: staryxchen <staryxchen@tencent.com>

- Extract common logic for device prefix and buffer count - Merge allocation and registration loops to reduce code duplication - Improve variable const-correctness Signed-off-by: staryxchen <staryxchen@tencent.com>

- Add`rpc_server_port`field to`XferBenchConfig` - Load port value from command line flags Signed-off-by: staryxchen <staryxchen@tencent.com>

gemini-code-assist · 2026-02-11T11:00:36Z

Summary of Changes

Hello @staryxchen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the tebench benchmark tool by introducing greater flexibility in GPU selection and RPC server configuration, improving its robustness with graceful interruption handling, and streamlining its internal buffer allocation mechanism. These changes aim to make the benchmark more user-friendly, reliable, and maintainable, while also addressing build-related issues for better deployment.

Highlights

Enhanced GPU Selection: Added options to specify a single GPU via --local_gpu_id and --target_gpu_id, or to use all available GPUs by setting the ID to -1.
Configurable RPC Server Port: Introduced a new flag, --rpc_server_port, allowing users to specify the RPC server port, with 0 enabling auto-selection.
Graceful Interruption Handling: Implemented logic to ensure the benchmark loop exits cleanly upon receiving a signal, preventing the benchmark from hanging.
Unified Buffer Allocation Logic: Refactored TENTBenchRunner::allocateBuffers() to consolidate DRAM and VRAM buffer allocation into a single, unified code path, reducing duplication.
Build System Improvements: Configured RPATH for the tebench executable to correctly locate libasio.so at runtime and added a missing <csignal> include in rpc.h.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

mooncake-transfer-engine/benchmark/CMakeLists.txt
- Corrected if statement formatting for CUDAToolkit_FOUND.
- Added RPATH configuration for the tebench executable to ensure libasio.so is found at runtime.
mooncake-transfer-engine/benchmark/main.cpp
- Modified processBatchSizes function to return an int for interruption status.
- Adjusted GPU ID calculation to support -1 for all GPUs and respect local_gpu_id/target_gpu_id offsets.
- Implemented interruption checks in the main benchmark loops to allow graceful exit.
mooncake-transfer-engine/benchmark/te_backend.cpp
- Updated VRAM buffer allocation logic to respect local_gpu_id for single GPU selection.
- Modified runInitiatorTasks to return an integer indicating if an interruption occurred.
mooncake-transfer-engine/benchmark/tent_backend.cpp
- Integrated rpc_server_port into the configuration loading.
- Refactored allocateBuffers to unify DRAM and VRAM buffer allocation logic, reducing redundancy.
- Modified runInitiatorTasks to return an integer indicating if an interruption occurred.
mooncake-transfer-engine/benchmark/utils.cpp
- Updated descriptions for local_gpu_id and target_gpu_id flags to mention -1 for all GPUs.
- Added a new rpc_server_port flag and its corresponding static member and loading logic.
mooncake-transfer-engine/benchmark/utils.h
- Declared rpc_server_port as a static member in XferBenchConfig.
mooncake-transfer-engine/tent/include/tent/rpc/rpc.h
- Included <csignal> header for signal handling.

Activity

No specific activity (comments, reviews, progress updates) has been provided for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several valuable improvements to the tebench tool, including GPU selection, graceful interruption handling, and build fixes. The refactoring to unify buffer allocation logic in TENTBenchRunner::allocateBuffers is a notable improvement to code quality, reducing duplication. However, the new GPU selection logic lacks sufficient validation for negative ID values (other than the special -1 value), which can lead to out-of-bounds memory access and potential crashes or undefined behavior. It is recommended to validate that local_gpu_id and target_gpu_id are either -1 or non-negative before they are used as offsets or device indices. Additionally, there is a minor suggestion regarding error handling to improve robustness.

gemini-code-assist · 2026-02-11T11:03:17Z

mooncake-transfer-engine/benchmark/main.cpp

+        auto target_gpu_offset = XferBenchConfig::target_gpu_id == -1
+                                     ? 0
+                                     : XferBenchConfig::target_gpu_id;
+        uint64_t local_addr = runner.getLocalBufferBase(
+            local_gpu_offset + thread_id, max_block_size, max_batch_size);
        uint64_t target_addr = runner.getTargetBufferBase(
-            XferBenchConfig::target_gpu_id + thread_id, max_block_size,
-            max_batch_size);
+            target_gpu_offset + thread_id, max_block_size, max_batch_size);


The logic for target_gpu_offset allows XferBenchConfig::target_gpu_id to be a negative value other than -1. When such a value is provided, target_gpu_offset + thread_id can be negative. This value is passed to runner.getTargetBufferBase, which in both TEBenchRunner and TENTBenchRunner implementations uses it as an index into an array after a modulo operation. In C++, the result of the modulo operator % with a negative operand is negative, resulting in an out-of-bounds array access to info_->buffers or pinned_buffer_list_.

mooncake-transfer-engine/benchmark/te_backend.cpp

gemini-code-assist · 2026-02-11T11:03:18Z

mooncake-transfer-engine/benchmark/tent_backend.cpp

+        if (XferBenchConfig::local_gpu_id != -1) {
+            start_idx = XferBenchConfig::local_gpu_id;
+            num_buffers = 1;
+            LOG_ASSERT(start_idx < gpu_count)
+                << "local_gpu_id " << start_idx << " exceeds device count "
+                << gpu_count;
        }


Similar to the issue in te_backend.cpp, local_gpu_id is not validated to be non-negative. While TENTBenchRunner uses CHECK_FAIL to handle allocation errors, providing a negative GPU ID will still cause the benchmark to fail and exit unexpectedly due to an invalid device location string (e.g., 'cuda:-2') being passed to the transfer engine.

mooncake-transfer-engine/benchmark/tent_backend.cpp

staryxchen · 2026-02-11T11:03:49Z

I am testing the performance of TENT, and the current test results cannot reach the theoretical bandwidth limit. The original TE was able to achieve this, see #675
@alogfans Do you have any idea?

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Replace ternary operators with`std::max`for offset calculation - Update assertions to check for non-negative GPU IDs Signed-off-by: staryxchen <staryxchen@tencent.com>

Signed-off-by: staryxchen <staryxchen@tencent.com>

codecov-commenter · 2026-02-12T03:34:55Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

alogfans

LGTM

alogfans · 2026-02-12T05:00:55Z

@staryxchen The original test benchmark uses GPU VRAM (#675), while the tebench may not use it by default.

staryxchen · 2026-02-12T06:52:36Z

@staryxchen The original test benchmark uses GPU VRAM (#675), while the tebench may not use it by default.

@alogfans I specified seg_type as VRAM, the specific command used and the result is as follows:

# target
./tebench --backend=tent --seg_name=$IP --seg_type=VRAM --max_block_size=$((1*1024*1024)) --local_gpu_id=-1 --start_block_size=$((1*1024*1024)) --max_num_threads=128 --start_num_threads=128 --total_buffer_size=$((4*1024*1024*1024)) --start_batch_size=16 --max_batch_size=16 --rpc_server_port=16568
# initiator
./tebench --backend=tent --target_seg_name=$IP:16568 --seg_type=VRAM --max_block_size=$((1*1024*1024)) --local_gpu_id=-1 --start_block_size=$((1*1024*1024)) --duration=30 --max_num_threads=128 --start_num_threads=128 --op_type=write --total_buffer_size=$((4*1024*1024*1024)) --start_batch_size=16 --max_batch_size=16

I believe there's still significant room for improvement. Have I missed any important configuration?

…ld fixes (kvcache-ai#1537) * build: add csignal header to rpc.h - Include csignal for signal handling in RPC module Signed-off-by: staryxchen <staryxchen@tencent.com> * build(benchmark): set RPATH for tebench executable - Configure tebench target to use relative RPATH for runtime library resolution - Ensure libasio.so can be found at runtime from the ../lib director Signed-off-by: staryxchen <staryxchen@tencent.com> * feat(benchmark): support graceful interruption during execution - Update runners to return error when stopped - Add early exit logic in main loop Signed-off-by: staryxchen <staryxchen@tencent.com> * feat(benchmark): support specifying single GPU ID or all GPUs - Update flags to accept -1 for all GPUs. - Adjust buffer allocation and address calculation for specific GPU selection. Signed-off-by: staryxchen <staryxchen@tencent.com> * refactor(benchmark): unify buffer allocation logic for DRAM and VRAM - Extract common logic for device prefix and buffer count - Merge allocation and registration loops to reduce code duplication - Improve variable const-correctness Signed-off-by: staryxchen <staryxchen@tencent.com> * feat(benchmark): add rpc server port configuration - Add`rpc_server_port`field to`XferBenchConfig` - Load port value from command line flags Signed-off-by: staryxchen <staryxchen@tencent.com> * Update mooncake-transfer-engine/benchmark/tent_backend.cpp Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * refactor(benchmark): simplify gpu id validation and offset calculation - Replace ternary operators with`std::max`for offset calculation - Update assertions to check for non-negative GPU IDs Signed-off-by: staryxchen <staryxchen@tencent.com> * retrigger CI Signed-off-by: staryxchen <staryxchen@tencent.com> --------- Signed-off-by: staryxchen <staryxchen@tencent.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

staryxchen added 6 commits February 11, 2026 18:39

build: add csignal header to rpc.h

cab2ea4

- Include csignal for signal handling in RPC module Signed-off-by: staryxchen <staryxchen@tencent.com>

build(benchmark): set RPATH for tebench executable

2322d98

- Configure tebench target to use relative RPATH for runtime library resolution - Ensure libasio.so can be found at runtime from the ../lib director Signed-off-by: staryxchen <staryxchen@tencent.com>

feat(benchmark): support graceful interruption during execution

c05b212

- Update runners to return error when stopped - Add early exit logic in main loop Signed-off-by: staryxchen <staryxchen@tencent.com>

feat(benchmark): support specifying single GPU ID or all GPUs

b3cf802

- Update flags to accept -1 for all GPUs. - Adjust buffer allocation and address calculation for specific GPU selection. Signed-off-by: staryxchen <staryxchen@tencent.com>

refactor(benchmark): unify buffer allocation logic for DRAM and VRAM

f0e6e28

- Extract common logic for device prefix and buffer count - Merge allocation and registration loops to reduce code duplication - Improve variable const-correctness Signed-off-by: staryxchen <staryxchen@tencent.com>

feat(benchmark): add rpc server port configuration

157064a

- Add`rpc_server_port`field to`XferBenchConfig` - Load port value from command line flags Signed-off-by: staryxchen <staryxchen@tencent.com>

staryxchen requested review from alogfans, chestnut-Q and doujiang24 as code owners February 11, 2026 11:00

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

Update mooncake-transfer-engine/benchmark/tent_backend.cpp

2eb1f16

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

github-actions bot added run-ci Transfer Engine labels Feb 11, 2026

staryxchen added 2 commits February 11, 2026 19:13

refactor(benchmark): simplify gpu id validation and offset calculation

4255f7b

- Replace ternary operators with`std::max`for offset calculation - Update assertions to check for non-negative GPU IDs Signed-off-by: staryxchen <staryxchen@tencent.com>

retrigger CI

01abd12

Signed-off-by: staryxchen <staryxchen@tencent.com>

alogfans approved these changes Feb 12, 2026

View reviewed changes

alogfans merged commit 1eec5a5 into kvcache-ai:main Feb 12, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[TENT] Improve tebench: GPU selection, graceful interruption, and build fixes#1537

[TENT] Improve tebench: GPU selection, graceful interruption, and build fixes#1537
alogfans merged 9 commits intokvcache-ai:mainfrom
staryxchen:dev/tebench

staryxchen commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

Uh oh!

staryxchen commented Feb 11, 2026

Uh oh!

codecov-commenter commented Feb 12, 2026

Uh oh!

alogfans left a comment

Uh oh!

Uh oh!

alogfans commented Feb 12, 2026

Uh oh!

staryxchen commented Feb 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

staryxchen commented Feb 11, 2026

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

staryxchen commented Feb 11, 2026

Uh oh!

codecov-commenter commented Feb 12, 2026

Codecov Report

Uh oh!

alogfans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alogfans commented Feb 12, 2026

Uh oh!

staryxchen commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

staryxchen commented Feb 12, 2026 •

edited

Loading