Skip to content

Comments

Add Local Cache Mechanism for Mooncake Store Client#1226

Merged
stmatengss merged 22 commits intokvcache-ai:mainfrom
Shichang-Zhang:upstream-zsc-working
Feb 11, 2026
Merged

Add Local Cache Mechanism for Mooncake Store Client#1226
stmatengss merged 22 commits intokvcache-ai:mainfrom
Shichang-Zhang:upstream-zsc-working

Conversation

@Shichang-Zhang
Copy link
Contributor

@Shichang-Zhang Shichang-Zhang commented Dec 17, 2025

Description

This PR introduces the client hot cache feature mentioned in Issue 1062.

This feature is enabled by setting the environment variable LOCAL_HOT_CACHE_SIZE (in bytes). The mooncake store client will allocate memory blocks for caching hot data on a per-request basis. The block size is configurable via the environment variable LOCAL_HOT_BLOCK_SIZE (in bytes), with a default value of 16MB. Memory allocation follows the configured block size unit. If LOCAL_HOT_CACHE_SIZE is set to less than the block size (either the default 16MB or the value specified by LOCAL_HOT_BLOCK_SIZE), zero, negative, or other invalid inputs (such as non-numeric strings), the client hot cache feature will be disabled.

For the standalone client scenario, the feature should be disabled since all things are in local storage.

For the architecture of multiple clients and a master, the feature is recommended for use. Users should adjust the LOCAL_HOT_CACHE_SIZE value based on their actual deployment context to strike an optimal balance: the performance gain from reduced cross-node data transfer outweighs the overhead of additional local cache memory.

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • [√ ] Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Run unit test for the feature mooncake-store/tests/client_local_hot_cache_test.cpp.

Performance Test

Software

Name Version
vLLM v0.9.1
vLLM Ascend v0.9.1
Mooncake v0.3.7
EvalScope v1.2.0
CANN 8.2RC1
Kubernetes v1.33.1

Hardware

Name Description
Machine Atlas 800I A2
NPU 8 * 昇腾910B4
CPU 4 * 鲲鹏920
Memory 32G * 32
Network 8 * 200GE QSFP

Mooncake Performance Test

Environment

We run the test on a Kubernetes cluster that involves two Atlas 800I A2 machines.

Method

  1. Start master-service pod as Mooncake Master and redis pod as the metadata server.

  2. Start a Mooncake Client pod on node 1.

  3. Call MooncakeDistributedStore:Put interface to store data (e.g., 1GB) on Mooncake Client 1.

    3.1 Store 10 keys with corresponding data (e.g., 1GB total) in Mooncake.

    3.2 All data slices are stored on node 1 since only one client exists.

  4. Start another Mooncake Client pod with local hot cache (e.g., 1GB) enabled on node 2.

  5. Call MooncakeDistributedStore:Get/BatchGet interface from Mooncake Client 2 to retrieve the stored data.

    5.1 All data slices are on node 1, so all slices requested by Mooncake Client 2 will be transferred remotely.

    5.2 Use TCP protocol to increase network transmission latency.

    5.3 After retrieving data from remote node 1, the slices are stored in the local hot cache.

  6. Record the latency of the Get/BatchGet calls.

  7. Repeatedly call MooncakeDistributedStore:Get/BatchGet interface from Mooncake Client 2 and record the latency.
    7.1 These requests result in cache hits.

Result

Method Total Key Num Data Size (MB) Total Data Size (MB) Local Cache Size (MB) Get/BatchGet Per Round Batch Size Test Rounds First Round Latency (ms) Average Round Latency (ms) Latency Delta (%)
get 25 10 250 0 25 - 10 767.64 881.02 -
get 25 10 250 400 25 - 10 1204.43 287.38 76.14
get 25 25 625 0 25 - 10 757.65 760.36 -
get 25 25 625 800 25 - 10 905.76 259.48 71.35
batchGet 25 10 250 0 5 5 10 489.24 463.28 -
batchGet 25 10 250 400 5 5 10 425.94 151.23 64.49
batchGet 25 25 625 0 5 5 10 433.15 440.47 -
batchGet 25 25 625 800 5 5 10 504.91 143.83 71.51

Note: The latency is measured for each key request, for BatchGet interface, first round latency is the average latency of first 2 key.

End to End Inference Performance Test

Environment

We run the test on a Kubernetes cluster that involves three Atlas 800I A2 machines. We deployed the inference service with vLLM v0 and Mooncake.

Method

  1. Start master-service pod as Mooncake Master and redis pod as the metadata server.

  2. Start 5 prefill pods and 5 decode pods running vLLM and Mooncake Client. Each client offers 30GB store memory.

    2.1 Deploy 10 Mooncake Clients to increase the probability of remote slice data transmission.

    2.2 Use TCP protocol and deploy across three machines to increase network transmission latency.

    2.3 Prefill instances do not enable the local hot cache feature, while decode instances enable it.

  3. Start proxy server pod for routing inference requests

    3.1 Use the example proxy server from vLLM.

  4. Start EvalScope script

    4.1 Send 1000 inference requests at concurrency 5 or 10, with a fraction of exact duplicate requests to simulate hotspot/cache hits. Average input length is 4096 tokens and output length is fixed at 256 tokens.

    4.2 Use a random dataset to vary prefix lengths, controlling prefix sharing across requests and thus the number of hot slices (and hit rate).

    4.3 EvalScope summarizes and logs key performance metrics such as throughput, latency (p50/p95/p99), and success/error rates.

Test Script

  • vLLM prefill
export MOONCAKE_CONFIG_PATH=/app/mooncake_config.json
echo "{
    \"local_hostname\": \"$POD_IP\",
    \"metadata_server\": \"redis://redis:6379\",
    \"master_server_address\": \"mooncake-master:30089\",
    \"protocol\": \"tcp\",
    \"device_name\": \"\",
    \"global_segment_size\": 32212254720
}" > ${MOONCAKE_CONFIG_PATH}

VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-8B \
  --port 8100 \
  --max-model-len 10000 \
  --gpu-memory-utilization 0.8 \
  --block-size 128 \
  --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer","kv_buffer_device":"npu"}'

Note: This is the start command of the prefill pod. So POD_IP is the node address of this pod.

  • vLLM decode
export MOONCAKE_CONFIG_PATH=/app/mooncake_config.json
echo "{
    \"local_hostname\": \"$POD_IP\",
    \"metadata_server\": \"redis://redis:6379\",
    \"master_server_address\": \"mooncake-master:30089\",
    \"protocol\": \"tcp\",
    \"device_name\": \"\",
    \"global_segment_size\": 32212254720
}" > ${MOONCAKE_CONFIG_PATH}

VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-8B \
  --port 8200 \
  --max-model-len 10000 \
  --gpu-memory-utilization 0.8 \
  --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer","kv_buffer_device":"npu"}'

Note: This is the start command of the decode pod. So POD_IP is the node address of this pod.

  • EvalScope DataSet
import json
import random
import os
import numpy as np
from datetime import datetime
from typing import List, Dict
from modelscope import AutoTokenizer


# --- Core configuration ---
MODEL_NAME = "Qwen/Qwen3-8B"
REQUEST_LEN = 4096  # Target token length for each request
DATASET_SIZE = 1000  # Total number of requests
REPEAT_PERCENT = 0.5  # Ratio of identical (repeat) requests (0.0-1.0)

# Fixed seeds for reproducibility
RANDOM_SEED = 42

class IdenticalDatasetGenerator:
    """Dataset generator for identical/repeated requests."""
    
    def __init__(self, model_name: str = MODEL_NAME):
        """Initialize the generator and load the tokenizer."""
        self.model_name = model_name
        
        print(f"Loading tokenizer from ModelScope: {model_name} ...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        print(f"Tokenizer loaded. Vocab size: {self.tokenizer.vocab_size}")
        
    def generate_random_tokens(self, length: int) -> str:
        """
        Generate random text that encodes to the target number of tokens.
        A decode->encode roundtrip is used to approximate exact token count.
        
        Args:
            length: Target number of tokens.
            
        Returns:
            A text string (ideally encoding to exactly `length` tokens).
        """
        if length <= 0:
            return ""
        
        # Sample token IDs, decode to text, then re-encode to validate the count.
        token_ids = np.random.choice(
            self.tokenizer.vocab_size,
            size=length,
            replace=True
        ).tolist()
        
        text = self.tokenizer.decode(token_ids, skip_special_tokens=False)
        
        encoded_ids = self.tokenizer.encode(text, add_special_tokens=False)
        actual_length = len(encoded_ids)
        
        # If the roundtrip shrinks the token count, append additional tokens.
        if actual_length < length:
            needed = length - actual_length
            additional_ids = np.random.choice(
                self.tokenizer.vocab_size,
                size=needed,
                replace=True
            ).tolist()
            additional_text = self.tokenizer.decode(additional_ids, skip_special_tokens=False)
            text = text + additional_text
        
        return text
    
    def generate_request(self) -> str:
        """Generate a single request with REQUEST_LEN tokens."""
        return self.generate_random_tokens(REQUEST_LEN)
    
    def generate_dataset(self) -> List[Dict]:
        """
        Generate the full dataset.
        
        Returns:
            A list of items with `prompt`, plus metadata fields.
        """
        repeat_count = int(DATASET_SIZE * REPEAT_PERCENT)
        unique_count = DATASET_SIZE - repeat_count
        
        print("Generating identical-request dataset...")
        print(
            f"Config: total={DATASET_SIZE}, unique={unique_count}, identical={repeat_count} "
            f"({REPEAT_PERCENT:.1%}), tokens_per_request={REQUEST_LEN}"
        )
        
        dataset = []
        
        identical_request = self.generate_request()
        
        print(f"Generating {unique_count} unique requests...")
        unique_requests = []
        for i in range(unique_count):
            if (i + 1) % 100 == 0:
                print(f"Progress: {i+1}/{unique_count} unique requests generated")
            unique_request = self.generate_request()
            unique_requests.append(unique_request)
        
        print("Building dataset...")
        
        for i, request in enumerate(unique_requests):
            dataset.append({
                "prompt": request,
                "is_identical": False,
                "request_id": i,
            })
        
        for i in range(repeat_count):
            dataset.append({
                "prompt": identical_request,
                "is_identical": True,
                "request_id": unique_count + i,
            })
        
        print("Shuffling dataset...")
        random.shuffle(dataset)
        
        # Reassign request_id after shuffling.
        for i, item in enumerate(dataset):
            item["request_id"] = i
        
        print("Dataset generation complete.")
        return dataset
    
    def save_dataset(self, dataset: List[Dict], output_dir: str = "."):
        """
        Save the dataset to a JSON file (timestamped for uniqueness).
        
        Args:
            dataset: Dataset items.
            output_dir: Output directory.
        """
        os.makedirs(output_dir, exist_ok=True)
        
        identical_count = sum(1 for item in dataset if item.get("is_identical", False))
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"identical_data_{DATASET_SIZE}total_{REQUEST_LEN}len_{REPEAT_PERCENT*100:.0f}pct_repeat_{timestamp}.json"
        filepath = os.path.join(output_dir, filename)
        
        print(f"Saving dataset to: {filepath}")
        
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(dataset, f, ensure_ascii=False, indent=2)
        
        file_size_mb = os.path.getsize(filepath) / 1024 / 1024
        print(
            f"Saved. size={file_size_mb:.2f} MB, total={len(dataset)}, "
            f"identical={identical_count}, unique={len(dataset) - identical_count}"
        )
        return filepath


def main():
    random.seed(RANDOM_SEED)
    np.random.seed(RANDOM_SEED)
    
    generator = IdenticalDatasetGenerator(model_name=MODEL_NAME)
    dataset = generator.generate_dataset()
    generator.save_dataset(dataset, output_dir="./identical_dataset")


if __name__ == "__main__":
    main()

Result

Parallel Identical Request Ratio Local Cache Size (GB) Throughput (tkn/s) TTFT (s) TPOT (s) Throughput Delta TTFT Delta TPOT Delta
5 50 0 577.89 26.18 0.0520 0.00% 0.00% 0.00%
5 50 1 745.41 17.02 0.0531 28.99% -35.00% 2.12%
5 50 2 820.39 13.94 0.0543 41.96% -46.74% 4.42%
5 50 4 889.92 11.73 0.0544 53.99% -55.20% 4.62%
10 50 0 772.65 42.06 0.0671 0.00% 0.00% 0.00%
10 50 1 956.47 27.94 0.0778 23.79% -33.58% 15.95%
10 50 2 1037.45 25.33 0.0727 34.27% -39.78% 8.35%
10 50 4 1105.28 22.66 0.0723 43.05% -46.12% 7.75%
5 70 0 549.76 28.63 0.0508 0.00% 0.00% 0.00%
5 70 1 1029.84 9.13 0.0511 87.33% -68.10% 0.59%
5 70 2 1139.46 6.51 0.0530 107.26% -77.27% 4.33%
5 70 4 1108.51 7.44 0.0514 101.64% -73.99% 1.18%
10 70 0 761.69 42.41 0.0681 0.00% 0.00% 0.00%
10 70 1 1401.02 14.25 0.0717 83.93% -66.39% 5.29%
10 70 2 1537.34 13.44 0.0633 101.83% -68.30% -7.05%
10 70 4 1502.74 13.66 0.0656 97.29% -67.78% -3.67%

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Shichang-Zhang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a local hot cache for the Mooncake Store client, designed to optimize data retrieval in distributed environments. By caching frequently accessed data locally, the system minimizes the overhead of repeated network transfers, leading to significant performance gains. The feature is configurable via an environment variable, allowing users to tailor cache size to their specific deployment needs, and includes robust handling for various configuration inputs.

Highlights

  • New Feature: Local Hot Cache: Introduces a local hot cache mechanism for the Mooncake Store client to improve performance by reducing cross-node data transfers.
  • Configuration: The hot cache is enabled and configured by setting the LOCAL_HOT_CACHE_SIZE environment variable to a value greater than 16MB (in bytes). Memory is allocated in 16MB units.
  • Usage Scenarios: Recommended for architectures with multiple clients and a master, where it can significantly reduce latency. It is disabled for standalone client scenarios.
  • Performance Improvement: Performance tests demonstrate substantial latency reductions, with 'get' operations showing up to 76.14% improvement and 'batchGet' operations up to 71.51% after initial data retrieval.
  • Asynchronous Caching: Slices transferred from remote nodes are asynchronously added to the local hot cache, preventing blocking of read operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance enhancement by adding a client-side local hot cache. The cache is designed to store frequently accessed data slices, reducing cross-node data transfers. The implementation includes an LRU cache (LocalHotCache), an asynchronous worker pool (LocalHotCacheHandler) to populate the cache without blocking the main Get/BatchGet path, and configuration via the LOCAL_HOT_CACHE_SIZE environment variable. The changes are well-structured and include a comprehensive set of unit and integration tests.

My review has identified a few critical issues, including merge conflicts and compilation errors, that must be addressed. I've also pointed out a potential regression in the Python bindings and offered suggestions to improve performance in hot paths and simplify the code. Overall, this is a great feature addition.

Comment on lines 1959 to 1963
// Check for negative values
if (!ev_size_str.empty() && ev_size_str[0] == '-') {
LOG(WARNING) << "Invalid LOCAL_HOT_CACHE_SIZE='" << ev_size_str << "', disable local hot cache";
return ErrorCode::INVALID_PARAMS;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check for a negative number ev_size_str[0] == '-' is redundant. std::stoull will throw a std::invalid_argument exception if the string starts with a '-', which is already caught by the try-catch block. You can remove this if block to simplify the code.

@stmatengss stmatengss self-assigned this Dec 23, 2025
@stmatengss stmatengss added the enhancement New feature or request label Dec 23, 2025
@stmatengss stmatengss assigned YiXR and unassigned stmatengss Dec 24, 2025
@Shichang-Zhang Shichang-Zhang requested a review from YiXR as a code owner January 5, 2026 07:32
@codecov-commenter
Copy link

codecov-commenter commented Jan 5, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 84.38287% with 124 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/local_hot_cache.cpp 77.29% 42 Missing ⚠️
mooncake-store/src/client_service.cpp 72.05% 38 Missing ⚠️
...oncake-store/tests/client_local_hot_cache_test.cpp 93.73% 28 Missing ⚠️
mooncake-store/src/transfer_task.cpp 0.00% 16 Missing ⚠️

📢 Thoughts on this report? Let us know!

@stmatengss
Copy link
Collaborator

chaos_rand_test failed, halting the build process.

@zhuxinjie-nz Some codes are related to you, Please review the code if possible.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a local hot cache mechanism for the Mooncake Store Client to improve performance by caching frequently accessed data locally, reducing cross-node data transfers. The cache is configurable via environment variables LOCAL_HOT_CACHE_SIZE (total cache size in bytes) and LOCAL_HOT_BLOCK_SIZE (block size in bytes, default 16MB). The feature is designed for multi-client architectures with a master node, where remote data fetches can benefit from local caching.

Key Changes:

  • Implements an LRU-based local hot cache with configurable memory allocation
  • Integrates cache checking and updating into Get and BatchGet operations
  • Adds asynchronous cache population using a worker thread pool
  • Improves local transfer detection by comparing IP addresses instead of full endpoints

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
mooncake-store/include/local_hot_cache.h Defines LocalHotCache and LocalHotCacheHandler classes with LRU cache management
mooncake-store/src/local_hot_cache.cpp Implements LRU cache operations, memory management, and async task processing
mooncake-store/include/client_service.h Adds cache-related public and private methods to Client class
mooncake-store/src/client_service.cpp Integrates cache into Get/BatchGet workflows, adds environment variable parsing
mooncake-store/src/transfer_task.cpp Adds IP address extraction helper for improved local transfer detection
mooncake-store/tests/client_local_hot_cache_test.cpp Comprehensive test suite for cache functionality and client integration
mooncake-store/tests/CMakeLists.txt Adds new test file to build configuration
mooncake-store/src/CMakeLists.txt Adds local_hot_cache.cpp to build sources
mooncake-integration/store/store_py.cpp Removes trailing whitespace (unrelated formatting fix)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

struct HotMemBlock {
void* addr; // Memory address
size_t size; // Block size in bytes
bool in_use; // Whether the block is currently in use
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HotMemBlock struct has an in_use field that is set in the code but never actually read or checked anywhere in the implementation. This field appears to be unused and should either be utilized in the logic or removed to reduce confusion and memory overhead.

Suggested change
bool in_use; // Whether the block is currently in use
[[maybe_unused]] bool in_use; // Whether the block is currently in use

Copilot uses AI. Check for mistakes.
Comment on lines 894 to 899
cache_hits++;
mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_;
mem_desc.buffer_descriptor.buffer_address_ =
reinterpret_cast<uintptr_t>(blk->addr);
if (mem_desc.buffer_descriptor.size_ != blk->size) {
LOG(WARNING) << "Cache hit but size mismatch for key: " << key;
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a cache hit occurs but the size doesn't match, only a warning is logged but the cached data is still used. This could lead to data corruption if the cached block size is smaller than expected, as the transfer operation may read beyond the cached block's memory. Consider returning 0 (cache miss) when sizes don't match to force a proper remote fetch, or validate that blk->size is at least as large as the expected size before using the cached data.

Suggested change
cache_hits++;
mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_;
mem_desc.buffer_descriptor.buffer_address_ =
reinterpret_cast<uintptr_t>(blk->addr);
if (mem_desc.buffer_descriptor.size_ != blk->size) {
LOG(WARNING) << "Cache hit but size mismatch for key: " << key;
// Validate that the cached block is large enough before using it.
if (blk->size >= mem_desc.buffer_descriptor.size_) {
mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_;
mem_desc.buffer_descriptor.buffer_address_ =
reinterpret_cast<uintptr_t>(blk->addr);
cache_hits++;
if (blk->size != mem_desc.buffer_descriptor.size_) {
LOG(WARNING) << "Cache hit with larger-than-expected size for key: "
<< key << " (expected=" << mem_desc.buffer_descriptor.size_
<< ", cached=" << blk->size << ")";
}
} else {
// Cached block is smaller than expected; treat as cache miss to avoid
// potential out-of-bounds access when transferring data.
LOG(WARNING) << "Cache hit but cached block is smaller than expected for key: "
<< key << " (expected=" << mem_desc.buffer_descriptor.size_
<< ", cached=" << blk->size << ")";

Copilot uses AI. Check for mistakes.
Comment on lines +457 to +460
err = client->InitLocalHotCache();
if (err != ErrorCode::OK) {
LOG(ERROR) << "Failed to initialize local hot cache";
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On failure path, we should return std::nullopt to user for telling the current environment doesn’t meet the configuration requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failure path is a fallback path, use an error log to tell user the configuration does not take effect. Should this wrong performance configuration block the initialization of Mooncake?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think if the Store cannot run under the condition of meeting the user's configuration, then the initialization should fail directly. For users, the failure of cluster startup will be the most direct feedback, and they will have to modify their startup configuration or check their environment

Comment on lines 269 to 272
Slice slice;
slice.ptr = task.data.data();
slice.size = task.size;
task.hot_cache->PutHotSlice(task.key, slice);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The handle already holds a hot_cache_ pointer, so the task doesn’t need to store another hot_cache pointer.
  2. if PutHotSlice() returns false, we should print error log

@XucSh XucSh mentioned this pull request Feb 10, 2026
16 tasks
@Shichang-Zhang Shichang-Zhang force-pushed the upstream-zsc-working branch 4 times, most recently from 7f9b6df to 68c54f4 Compare February 10, 2026 15:20
…py operation when inserting local hot cache
@Shichang-Zhang Shichang-Zhang force-pushed the upstream-zsc-working branch 2 times, most recently from bfbc7cc to a081245 Compare February 10, 2026 17:44
@stmatengss stmatengss merged commit f21b379 into kvcache-ai:main Feb 11, 2026
16 checks passed
Comment on lines +990 to +993
if (mem_desc.buffer_descriptor.size_ != blk->size) {
LOG(ERROR) << "Cache hit but size mismatch for key: " << key;
return false;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the both sizes in the log

Comment on lines +2541 to +2544
// Only cache slices that came from TE transfer (non-local).
if (mem_desc.buffer_descriptor.transport_endpoint_ == local_hostname_) {
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapper this condition as a function which decides whether need to SubmitPutTask. Each caller of ProcessSlicesAsync() should check this condition function

Comment on lines +2546 to +2552
// Identify TE transfer slices (non-local) and submit async put tasks
for (size_t i = 0; i < slices.size(); ++i) {
if (!hot_cache_handler_->SubmitPutTask(key, slices[i])) {
LOG(ERROR) << "Failed to submit hot cache put task for key=" << key
<< " slice_idx=" << i;
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By defaut, each key only has one slice currently. we should not use a loop here. I think it is better to add a defensive checking to verify whether the slice num is one

Comment on lines +2549 to +2550
LOG(ERROR) << "Failed to submit hot cache put task for key=" << key
<< " slice_idx=" << i;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use warning log is enough

}
}

bool LocalHotCache::PutHotKey(HotMemBlock* block) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function handle two logics: releasing free block and enqueuing allocated blocks (enqueuing at the end and the beginning of lru_queue respectively). However, for easier code maintenance, I think it is better to split lru_queue into two parts: one is free_block_list which just allocates and release memory and one is lru_queue which is similar like current.

* @param block The block containing the data and key.
* @return true if inserted successfully, false if race condition or error.
*/
bool PutHotKey(HotMemBlock* block);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each function should use tl::expected<T,ErrorCode> as return value

return key_to_lru_it_.find(key) != key_to_lru_it_.end();
}

HotMemBlock* LocalHotCache::GetHotKey(const std::string& key) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to wrapper a HotCacheGuard which stores HotMemBlock. Then, GetHotKey() should return the guard. When it is destructured, it will auto to call ReleaseHotKey.

Comment on lines +172 to +174
if (victim_it == lru_queue_.end()) {
return nullptr;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print a warning log to explain the queue is full

Comment on lines +255 to +258
// Optimization: if key exists, just touch LRU to avoid data copy
if (hot_cache_->TouchHotKey(key)) {
return true;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just check the existence of key is enough. It is no need to touch lru list

Comment on lines +337 to +339
if (task.hot_cache->PutHotKey(task.block)) {
VLOG(2) << "Put task completed: " << task.key;
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is no need to store hot_cache ptr in the task. The LocalHotCacheHandler already has the ptr

@wanyue-wy
Copy link
Collaborator

In the following advanced PR based on this basic PR should to resolve the following tasks:

  1. Optimize the LRU mechanism: currently each cache read request is blocked by global lru lock. The granularity of the lock is too large. We should consider a more effective concurrent algorithm.
  2. Sync memcpy when Cache task enqueue: the capacity of hot_cache is further smaller than that of the cluster storage. Cache misses are very common. Then, each missed scene will trigger the enqueuing of the cache task. At this point, the synchronized memcpy will seriously affect the overall performance
  3. the hot cache should be compatible with P2P-Mooncake-Store Structure

qiuweit7 pushed a commit to openanolis/Mooncake that referenced this pull request Feb 13, 2026
…ai#1226)

* feat(Store): add local hot cache for client

* feat(Store): add client local hot cache log to show performance

* fix: local hot cache initialize bug

* fix(Store): Mooncake put slice is max 16MB, so make local hot cache block 16MB

* feat(Store): move local hot cache initialization to Client::Create

* feat(Store):  local hot cache remove unused small block implementation

* feat(Store): add client local hot cache unit test

* fix(Store): modify client local hot cache suit with v0.3.7

* feat(Store): change local hot cache unit tes

* fix: initialize local hot cache with negative value

* feat: use in process master and metadata fro local hot cache unit test.

* feat: update local hot cache to one replica one slice version

* fix: local hot cache unit test use in process master service

* fix: code style fix

* fix: fix dirty read when client wants to read a previously hitted hot block but the hot block is modified by incoming put actions

* fix: local hot cache unit test use in process master service

* fix: code format fix

* fix: fix comment problems for

* feat: add local hot asynchronous queue size limit

* fix: local hot cache task involves the block so that there is no memcpy operation when inserting local hot cache

* fix: code check fix

* fix: update block in_use prop to reference count

---------

Co-authored-by: shichangzhang064 <zhangshichang@h-partners.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request run-ci Store

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants