Skip to content

Comments

[Store] feat: CXL storage full features.#1531

Open
qiuweit7 wants to merge 15 commits intokvcache-ai:mainfrom
openanolis:qiuweit7/cxl_full
Open

[Store] feat: CXL storage full features.#1531
qiuweit7 wants to merge 15 commits intokvcache-ai:mainfrom
openanolis:qiuweit7/cxl_full

Conversation

@qiuweit7
Copy link
Contributor

Description

This is the second PR for the CXL feature, which mainly completes the CXL storage function (based on the first submitted PR PR #1365). The main changes introduced by this PR:

  1. Added test verification for CXL zero-copy and other interfaces, including zero-copy get_into/put_from operations, batch zero-copy operations, and others.
  2. Added test verification for the CXL memory EVICTION mechanism.
  3. Modified printing of CXL memory usage statistics.

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @qiuweit7, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the CXL storage capabilities by introducing comprehensive test coverage for its core functionalities. It ensures the reliability of zero-copy data transfers and the memory eviction process, which are critical for efficient CXL memory management. The changes also refine how memory usage is reported, providing clearer insights into system performance.

Highlights

  • CXL Storage Feature Completion: This pull request completes the CXL storage functionality, building upon a previously submitted PR.
  • CXL Zero-Copy and Batch Operations Testing: New test verifications have been added for CXL zero-copy interfaces, including get_into and put_from operations, as well as batch zero-copy operations.
  • CXL Memory Eviction Mechanism Testing: Test verification for the CXL memory eviction mechanism has been introduced, including a new configuration option for eviction high watermark ratio.
  • CXL Memory Usage Statistics: The logging and printing of CXL memory usage statistics have been modified for better clarity and debugging.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • extern/pybind11
    • Updated the pybind11 submodule to a newer commit.
  • mooncake-store/include/master_config.h
    • Added a new optional configuration field, eviction_high_watermark_ratio, to InProcMasterConfig and its builder, allowing dynamic adjustment of eviction thresholds.
  • mooncake-store/src/master_service.cpp
    • Refactored the Remove and RemoveByRegex function signatures for improved readability.
    • Introduced detailed LOG(INFO) messages to indicate when the eviction process is triggered and completed, enhancing debuggability.
    • Simplified the VLOG message for evict_objects by removing redundant concatenation.
  • mooncake-store/src/segment.cpp
    • Relocated the call to inc_total_mem_capacity from MountSegment to initializeCxlAllocator to ensure correct metric reporting for CXL allocators.
  • mooncake-store/tests/cxl_client_integration_test.cpp
    • Added a new command-line flag eviction_high_watermark_ratio for configuring eviction behavior in tests.
    • Integrated the new eviction_high_watermark_ratio into the InProcMasterConfigBuilder for test setup.
    • Removed unused static member variables segment_ptr_ and ram_buffer_size_ to clean up the test class.
    • Introduced a new test case EvictOperation to verify the CXL memory eviction mechanism under load.
  • mooncake-store/tests/test_server_helpers.h
    • Modified the InProcMaster class to conditionally use the eviction_high_watermark_ratio from the provided configuration, falling back to a default if not specified.
  • mooncake-wheel/tests/test_distributed_object_store_cxl.py
    • Added a new Python integration test file for CXL features, covering zero-copy operations (put_from, get_into), batch operations, concurrent stress testing, and an end-to-end fuzz test.
  • scripts/run_tests.sh
    • Added a new step to the test script to execute the test_distributed_object_store_cxl.py Python integration test, including starting and stopping the Mooncake Master with CXL enabled.
Activity
  • No human activity (comments, reviews) was detected on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR completes the CXL storage functionality by adding test verification for zero-copy operations and the eviction mechanism. The changes are generally well-implemented. My review includes two main points: first, the C++ integration test for eviction should be improved to actually verify that keys are evicted. Second, the new Python tests contain commented-out code that should be removed for better clarity.

}
}
LOG(INFO) << "Test finished, put_data=" << test_put_size / 1048576 << "MB";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The EvictOperation test is incomplete as it doesn't verify that any keys were actually evicted. This makes it difficult to confirm that the eviction mechanism is working correctly. Adding assertions to check for evicted keys would make this test more robust.

    LOG(INFO) << "Test finished, put_data=" << test_put_size / 1048576 << "MB";

    // Verify that some keys have been evicted.
    // It may take some time for eviction to complete.
    std::this_thread::sleep_for(std::chrono::seconds(5));

    bool eviction_happened = false;
    for (uint64_t i = 0; i < 100; ++i) {
        std::string key = "evict_key_" + std::to_string(i);
        auto exist_result = test_client_->IsExist(key);
        ASSERT_TRUE(exist_result.has_value());
        if (!exist_result.value()) {
            eviction_happened = true;
            LOG(INFO) << "Key " << key << " was evicted as expected.";
            break;
        }
    }
    ASSERT_TRUE(eviction_happened)
        << "No keys were evicted, the eviction mechanism might not be working.";
}

Comment on lines 236 to 238
# Register the buffer for zero-copy operations
# result = self.store.register_buffer(buffer_ptr, buffer_size)
# self.assertEqual(result, 0, "Buffer registration should succeed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calls to register_buffer are commented out in this test and other zero-copy tests (test_batch_get_into_operations, test_batch_put_from_operations). If buffer registration is not required for CXL zero-copy operations, these commented lines and their corresponding unregister_buffer calls in cleanup sections should be removed to improve code clarity. If registration is conditionally required, a comment explaining why it's disabled here would be helpful.

@codecov-commenter
Copy link

codecov-commenter commented Feb 10, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 96.61017% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/include/master_config.h 71.42% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@qiuweit7 qiuweit7 requested a review from Ann-1024 as a code owner February 11, 2026 09:04
merge main: sync latest storage_backend_test fix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete these changes

@stmatengss
Copy link
Collaborator

Code Review: CXL Storage Full Features

Overview

This PR completes the CXL storage feature implementation with comprehensive testing for zero-copy operations and eviction mechanism. Overall, this is a solid PR with excellent test coverage that significantly enhances CXL support.

✅ Strengths

  1. Comprehensive Test Coverage - 675 lines of Python tests covering zero-copy operations, batch operations, eviction, concurrent stress testing, and fuzz testing
  2. Good Architectural Decisions - Proper separation of CXL protocol from regular memory registration
  3. Improved Observability - Added detailed eviction logging for debugging

⚠️ Issues & Suggestions

1. Configuration Validation (Medium Priority)

mooncake-store/include/master_config.h:64-66

The new eviction_high_watermark_ratio parameter lacks validation. Suggest adding:

InProcMasterConfigBuilder& set_eviction_high_watermark_ratio(double ratio) {
    if (ratio < 0.0 || ratio > 1.0) {
        throw std::invalid_argument("eviction_high_watermark_ratio must be between 0.0 and 1.0");
    }
    eviction_high_watermark_ratio_ = ratio;
    return *this;
}

2. Test Reliability

mooncake-wheel/tests/test_distributed_object_store_cxl.py:245,555

Hard-coded sleep times can cause flaky tests. Consider using polling with timeout:

def wait_for_eviction(store, keys, timeout=10):
    start = time.time()
    while time.time() - start < timeout:
        for key in keys:
            if not store.is_exist(key):
                return True
        time.sleep(0.1)
    return False

3. CXL Simulation File Path

mooncake-wheel/tests/test_distributed_object_store_cxl.py:293

File created in current directory instead of /tmp. Use tempfile module:

import tempfile
CXL_SIM_FILE = os.path.join(tempfile.gettempdir(), "tmp_dax_sim")

4. Test Script Cleanup

scripts/run_tests.sh:969-983

Add trap to ensure master process cleanup even if test fails:

trap "kill $CXL_MASTER_PID 2>/dev/null" EXIT

5. Pybind11 Submodule Update

extern/pybind11:37-38

The submodule update lacks explanation in the PR description. Please clarify why this update is needed.

📋 Checklist Review

  • ✅ Self-review performed
  • Code formatting not run - Please run ./scripts/code_format.sh
  • ❌ Documentation not updated
  • ✅ Tests added

🎯 Recommendations

Before Merge:

  1. Run code formatter: ./scripts/code_format.sh
  2. Add validation for eviction_high_watermark_ratio
  3. Fix CXL simulation file path to use /tmp
  4. Add trap for cleanup in test script
  5. Document pybind11 submodule update reason

Nice to Have:

  • Replace sleep-based waits with polling
  • Add comments for magic error codes (-200, -706)

Verdict

Approve with minor changes

The core implementation is sound with excellent test coverage. Needs formatting fixes and minor improvements to test reliability before merge.

Comment on lines -28 to -29
MasterMetricManager::instance().inc_total_mem_capacity(segment.name,
size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete these two lines?

@stmatengss
Copy link
Collaborator

Don't merge it until pybind dependency is resolved.

qiuweit7 and others added 9 commits February 13, 2026 15:57
…vcache-ai#1523)

When building with USE_EFA=ON, auto_discover is disabled to prevent
RDMA transport installation (QP creation fails on EFA devices). This
means TCP transport is also not installed automatically. Add explicit
TCP transport installation for non-EFA protocols in the EFA build path.

Documentation changes:
- build.md: Add USE_EFA option and clarify USE_CUDA default/purpose
- supported-protocols.md: Add EFA as a supported protocol
- efa_transport.md: Add USE_CUDA=ON to build command, document GPU
  memory requirement

Co-authored-by: whn09 <whn09@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>
…ai#1226)

* feat(Store): add local hot cache for client

* feat(Store): add client local hot cache log to show performance

* fix: local hot cache initialize bug

* fix(Store): Mooncake put slice is max 16MB, so make local hot cache block 16MB

* feat(Store): move local hot cache initialization to Client::Create

* feat(Store):  local hot cache remove unused small block implementation

* feat(Store): add client local hot cache unit test

* fix(Store): modify client local hot cache suit with v0.3.7

* feat(Store): change local hot cache unit tes

* fix: initialize local hot cache with negative value

* feat: use in process master and metadata fro local hot cache unit test.

* feat: update local hot cache to one replica one slice version

* fix: local hot cache unit test use in process master service

* fix: code style fix

* fix: fix dirty read when client wants to read a previously hitted hot block but the hot block is modified by incoming put actions

* fix: local hot cache unit test use in process master service

* fix: code format fix

* fix: fix comment problems for

* feat: add local hot asynchronous queue size limit

* fix: local hot cache task involves the block so that there is no memcpy operation when inserting local hot cache

* fix: code check fix

* fix: update block in_use prop to reference count

---------

Co-authored-by: shichangzhang064 <zhangshichang@h-partners.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants