[Store] feat: CXL storage full features. by qiuweit7 · Pull Request #1531 · kvcache-ai/Mooncake

qiuweit7 · 2026-02-10T11:50:15Z

Description

This is the second PR for the CXL feature, which mainly completes the CXL storage function (based on the first submitted PR PR #1365). The main changes introduced by this PR:

Added test verification for CXL zero-copy and other interfaces, including zero-copy get_into/put_from operations, batch zero-copy operations, and others.
Added test verification for the CXL memory EVICTION mechanism.
Modified printing of CXL memory usage statistics.

Type of Change

Types
- Bug fix
- New feature
  - Transfer Engine
  - Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

How Has This Been Tested?

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist · 2026-02-10T11:50:38Z

Summary of Changes

Hello @qiuweit7, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the CXL storage capabilities by introducing comprehensive test coverage for its core functionalities. It ensures the reliability of zero-copy data transfers and the memory eviction process, which are critical for efficient CXL memory management. The changes also refine how memory usage is reported, providing clearer insights into system performance.

Highlights

CXL Storage Feature Completion: This pull request completes the CXL storage functionality, building upon a previously submitted PR.
CXL Zero-Copy and Batch Operations Testing: New test verifications have been added for CXL zero-copy interfaces, including get_into and put_from operations, as well as batch zero-copy operations.
CXL Memory Eviction Mechanism Testing: Test verification for the CXL memory eviction mechanism has been introduced, including a new configuration option for eviction high watermark ratio.
CXL Memory Usage Statistics: The logging and printing of CXL memory usage statistics have been modified for better clarity and debugging.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

extern/pybind11
- Updated the pybind11 submodule to a newer commit.
mooncake-store/include/master_config.h
- Added a new optional configuration field, eviction_high_watermark_ratio, to InProcMasterConfig and its builder, allowing dynamic adjustment of eviction thresholds.
mooncake-store/src/master_service.cpp
- Refactored the Remove and RemoveByRegex function signatures for improved readability.
- Introduced detailed LOG(INFO) messages to indicate when the eviction process is triggered and completed, enhancing debuggability.
- Simplified the VLOG message for evict_objects by removing redundant concatenation.
mooncake-store/src/segment.cpp
- Relocated the call to inc_total_mem_capacity from MountSegment to initializeCxlAllocator to ensure correct metric reporting for CXL allocators.
mooncake-store/tests/cxl_client_integration_test.cpp
- Added a new command-line flag eviction_high_watermark_ratio for configuring eviction behavior in tests.
- Integrated the new eviction_high_watermark_ratio into the InProcMasterConfigBuilder for test setup.
- Removed unused static member variables segment_ptr_ and ram_buffer_size_ to clean up the test class.
- Introduced a new test case EvictOperation to verify the CXL memory eviction mechanism under load.
mooncake-store/tests/test_server_helpers.h
- Modified the InProcMaster class to conditionally use the eviction_high_watermark_ratio from the provided configuration, falling back to a default if not specified.
mooncake-wheel/tests/test_distributed_object_store_cxl.py
- Added a new Python integration test file for CXL features, covering zero-copy operations (put_from, get_into), batch operations, concurrent stress testing, and an end-to-end fuzz test.
scripts/run_tests.sh
- Added a new step to the test script to execute the test_distributed_object_store_cxl.py Python integration test, including starting and stopping the Mooncake Master with CXL enabled.

Activity

No human activity (comments, reviews) was detected on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR completes the CXL storage functionality by adding test verification for zero-copy operations and the eviction mechanism. The changes are generally well-implemented. My review includes two main points: first, the C++ integration test for eviction should be improved to actually verify that keys are evicted. Second, the new Python tests contain commented-out code that should be removed for better clarity.

gemini-code-assist · 2026-02-10T11:54:17Z

mooncake-store/tests/cxl_client_integration_test.cpp

+        }
+    }
+    LOG(INFO) << "Test finished, put_data=" << test_put_size / 1048576 << "MB";
+}


The EvictOperation test is incomplete as it doesn't verify that any keys were actually evicted. This makes it difficult to confirm that the eviction mechanism is working correctly. Adding assertions to check for evicted keys would make this test more robust.

LOG(INFO) << "Test finished, put_data=" << test_put_size / 1048576 << "MB"; // Verify that some keys have been evicted. // It may take some time for eviction to complete. std::this_thread::sleep_for(std::chrono::seconds(5)); bool eviction_happened = false; for (uint64_t i = 0; i < 100; ++i) { std::string key = "evict_key_" + std::to_string(i); auto exist_result = test_client_->IsExist(key); ASSERT_TRUE(exist_result.has_value()); if (!exist_result.value()) { eviction_happened = true; LOG(INFO) << "Key " << key << " was evicted as expected."; break; } } ASSERT_TRUE(eviction_happened) << "No keys were evicted, the eviction mechanism might not be working."; }

gemini-code-assist · 2026-02-10T11:54:17Z

mooncake-wheel/tests/test_distributed_object_store_cxl.py

+        # Register the buffer for zero-copy operations
+        # result = self.store.register_buffer(buffer_ptr, buffer_size)
+        # self.assertEqual(result, 0, "Buffer registration should succeed")


The calls to register_buffer are commented out in this test and other zero-copy tests (test_batch_get_into_operations, test_batch_put_from_operations). If buffer registration is not required for CXL zero-copy operations, these commented lines and their corresponding unregister_buffer calls in cleanup sections should be removed to improve code clarity. If registration is conditionally required, a comment explaining why it's disabled here would be helpful.

codecov-commenter · 2026-02-10T12:21:11Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 96.61017% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/include/master_config.h	71.42%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

merge main: sync latest storage_backend_test fix

stmatengss · 2026-02-12T02:50:42Z

extern/pybind11

Please delete these changes

stmatengss · 2026-02-12T06:01:08Z

Code Review: CXL Storage Full Features

Overview

This PR completes the CXL storage feature implementation with comprehensive testing for zero-copy operations and eviction mechanism. Overall, this is a solid PR with excellent test coverage that significantly enhances CXL support.

✅ Strengths

Comprehensive Test Coverage - 675 lines of Python tests covering zero-copy operations, batch operations, eviction, concurrent stress testing, and fuzz testing
Good Architectural Decisions - Proper separation of CXL protocol from regular memory registration
Improved Observability - Added detailed eviction logging for debugging

⚠️ Issues & Suggestions

1. Configuration Validation (Medium Priority)

mooncake-store/include/master_config.h:64-66

The new eviction_high_watermark_ratio parameter lacks validation. Suggest adding:

InProcMasterConfigBuilder& set_eviction_high_watermark_ratio(double ratio) {
    if (ratio < 0.0 || ratio > 1.0) {
        throw std::invalid_argument("eviction_high_watermark_ratio must be between 0.0 and 1.0");
    }
    eviction_high_watermark_ratio_ = ratio;
    return *this;
}

2. Test Reliability

mooncake-wheel/tests/test_distributed_object_store_cxl.py:245,555

Hard-coded sleep times can cause flaky tests. Consider using polling with timeout:

def wait_for_eviction(store, keys, timeout=10):
    start = time.time()
    while time.time() - start < timeout:
        for key in keys:
            if not store.is_exist(key):
                return True
        time.sleep(0.1)
    return False

3. CXL Simulation File Path

mooncake-wheel/tests/test_distributed_object_store_cxl.py:293

File created in current directory instead of /tmp. Use tempfile module:

import tempfile
CXL_SIM_FILE = os.path.join(tempfile.gettempdir(), "tmp_dax_sim")

4. Test Script Cleanup

scripts/run_tests.sh:969-983

Add trap to ensure master process cleanup even if test fails:

trap "kill $CXL_MASTER_PID 2>/dev/null" EXIT

5. Pybind11 Submodule Update

extern/pybind11:37-38

The submodule update lacks explanation in the PR description. Please clarify why this update is needed.

📋 Checklist Review

✅ Self-review performed
❌ Code formatting not run - Please run ./scripts/code_format.sh
❌ Documentation not updated
✅ Tests added

🎯 Recommendations

Before Merge:

Run code formatter: ./scripts/code_format.sh
Add validation for eviction_high_watermark_ratio
Fix CXL simulation file path to use /tmp
Add trap for cleanup in test script
Document pybind11 submodule update reason

Nice to Have:

Replace sleep-based waits with polling
Add comments for magic error codes (-200, -706)

Verdict

Approve with minor changes ✅

The core implementation is sound with excellent test coverage. Needs formatting fixes and minor improvements to test reliability before merge.

stmatengss · 2026-02-12T15:21:58Z

mooncake-store/src/segment.cpp

-            MasterMetricManager::instance().inc_total_mem_capacity(segment.name,
-                                                                   size);


Why delete these two lines?

stmatengss · 2026-02-12T15:23:00Z

Don't merge it until pybind dependency is resolved.

…vcache-ai#1523) When building with USE_EFA=ON, auto_discover is disabled to prevent RDMA transport installation (QP creation fails on EFA devices). This means TCP transport is also not installed automatically. Add explicit TCP transport installation for non-EFA protocols in the EFA build path. Documentation changes: - build.md: Add USE_EFA option and clarify USE_CUDA default/purpose - supported-protocols.md: Add EFA as a supported protocol - efa_transport.md: Add USE_CUDA=ON to build command, document GPU memory requirement Co-authored-by: whn09 <whn09@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>

… add delete safety (kvcache-ai#1456)

…ai#1226) * feat(Store): add local hot cache for client * feat(Store): add client local hot cache log to show performance * fix: local hot cache initialize bug * fix(Store): Mooncake put slice is max 16MB， so make local hot cache block 16MB * feat(Store): move local hot cache initialization to Client::Create * feat(Store): local hot cache remove unused small block implementation * feat(Store): add client local hot cache unit test * fix(Store): modify client local hot cache suit with v0.3.7 * feat(Store): change local hot cache unit tes * fix: initialize local hot cache with negative value * feat: use in process master and metadata fro local hot cache unit test. * feat: update local hot cache to one replica one slice version * fix: local hot cache unit test use in process master service * fix: code style fix * fix: fix dirty read when client wants to read a previously hitted hot block but the hot block is modified by incoming put actions * fix: local hot cache unit test use in process master service * fix: code format fix * fix: fix comment problems for * feat: add local hot asynchronous queue size limit * fix: local hot cache task involves the block so that there is no memcpy operation when inserting local hot cache * fix: code check fix * fix: update block in_use prop to reference count --------- Co-authored-by: shichangzhang064 <zhangshichang@h-partners.com>

…to cxl_full

[Store] feat: CXL storage full features.

c34ff64

qiuweit7 requested review from ShangmingCai, XucSh, YiXR, stmatengss and ykwd as code owners February 10, 2026 11:50

github-actions bot added run-ci Store Installation labels Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

qiuweit7 added 2 commits February 11, 2026 14:36

fix(ci): resolve cxl test failure and code format

90cd414

fix(ci): add CXL protocol support and fix code format issues

bd3e31e

qiuweit7 requested a review from Ann-1024 as a code owner February 11, 2026 09:04

Merge branch 'main' into cxl_full

fb22ac1

merge main: sync latest storage_backend_test fix

stmatengss reviewed Feb 12, 2026

View reviewed changes

extern/pybind11

Copy link

Collaborator

stmatengss Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete these changes

stmatengss reviewed Feb 12, 2026

View reviewed changes

stmatengss approved these changes Feb 12, 2026

View reviewed changes

qiuweit7 and others added 9 commits February 13, 2026 15:57

[Store] feat: CXL storage full features, reset and rm extern/pybind

1e26461

[Store] Optimize BucketStorageBackend for reduced lock contention and…

a8e4dd8

… add delete safety (kvcache-ai#1456)

fix(ci): resolve cxl test failure and code format

aa6322d

fix(ci): add CXL protocol support and fix code format issues

2fd4384

fix: address comments from code review

2ef2b64

Merge branch 'qiuweit7/cxl_full' of github.com:openanolis/Mooncake in…

802b21d

…to cxl_full

Merge branch 'kvcache-ai:main' into qiuweit7/cxl_full

33e8f6c

qiuweit7 added 2 commits February 24, 2026 10:33

fix(ci): resolve cxl test failure

a3e2006

fix(ci): resolve ci error

1338dda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Store] feat: CXL storage full features.#1531

[Store] feat: CXL storage full features.#1531
qiuweit7 wants to merge 15 commits intokvcache-ai:mainfrom
openanolis:qiuweit7/cxl_full

qiuweit7 commented Feb 10, 2026

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Uh oh!

gemini-code-assist bot Feb 10, 2026

Uh oh!

codecov-commenter commented Feb 10, 2026 •

edited

Loading

Uh oh!

stmatengss Feb 12, 2026

Uh oh!

stmatengss commented Feb 12, 2026

Uh oh!

stmatengss Feb 12, 2026

Uh oh!

stmatengss commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		MasterMetricManager::instance().inc_total_mem_capacity(segment.name,
		size);

Comments

Conversation

qiuweit7 commented Feb 10, 2026

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stmatengss Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Feb 12, 2026

Code Review: CXL Storage Full Features

Overview

✅ Strengths

⚠️ Issues & Suggestions

1. Configuration Validation (Medium Priority)

2. Test Reliability

3. CXL Simulation File Path

4. Test Script Cleanup

5. Pybind11 Submodule Update

📋 Checklist Review

🎯 Recommendations

Verdict

Uh oh!

stmatengss Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Feb 10, 2026 •

edited

Loading