Add Local Cache Mechanism for Mooncake Store Client by Shichang-Zhang · Pull Request #1226 · kvcache-ai/Mooncake

Shichang-Zhang · 2025-12-17T11:32:55Z

Description

This PR introduces the client hot cache feature mentioned in Issue 1062.

This feature is enabled by setting the environment variable LOCAL_HOT_CACHE_SIZE (in bytes). The mooncake store client will allocate memory blocks for caching hot data on a per-request basis. The block size is configurable via the environment variable LOCAL_HOT_BLOCK_SIZE (in bytes), with a default value of 16MB. Memory allocation follows the configured block size unit. If LOCAL_HOT_CACHE_SIZE is set to less than the block size (either the default 16MB or the value specified by LOCAL_HOT_BLOCK_SIZE), zero, negative, or other invalid inputs (such as non-numeric strings), the client hot cache feature will be disabled.

For the standalone client scenario, the feature should be disabled since all things are in local storage.

For the architecture of multiple clients and a master, the feature is recommended for use. Users should adjust the LOCAL_HOT_CACHE_SIZE value based on their actual deployment context to strike an optimal balance: the performance gain from reduced cross-node data transfer outweighs the overhead of additional local cache memory.

Type of Change

Types
- Bug fix
- New feature
  - Transfer Engine
  - [√ ] Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

How Has This Been Tested?

Run unit test for the feature mooncake-store/tests/client_local_hot_cache_test.cpp.

Performance Test

Software

Name	Version
vLLM	v0.9.1
vLLM Ascend	v0.9.1
Mooncake	v0.3.7
EvalScope	v1.2.0
CANN	8.2RC1
Kubernetes	v1.33.1

Hardware

Name	Description
Machine	Atlas 800I A2
NPU	8 * 昇腾910B4
CPU	4 * 鲲鹏920
Memory	32G * 32
Network	8 * 200GE QSFP

Mooncake Performance Test

Environment

We run the test on a Kubernetes cluster that involves two Atlas 800I A2 machines.

Method

Start master-service pod as Mooncake Master and redis pod as the metadata server.
Start a Mooncake Client pod on node 1.
Call MooncakeDistributedStore:Put interface to store data (e.g., 1GB) on Mooncake Client 1.

3.1 Store 10 keys with corresponding data (e.g., 1GB total) in Mooncake.

3.2 All data slices are stored on node 1 since only one client exists.
Start another Mooncake Client pod with local hot cache (e.g., 1GB) enabled on node 2.
Call MooncakeDistributedStore:Get/BatchGet interface from Mooncake Client 2 to retrieve the stored data.

5.1 All data slices are on node 1, so all slices requested by Mooncake Client 2 will be transferred remotely.

5.2 Use TCP protocol to increase network transmission latency.

5.3 After retrieving data from remote node 1, the slices are stored in the local hot cache.
Record the latency of the Get/BatchGet calls.
Repeatedly call MooncakeDistributedStore:Get/BatchGet interface from Mooncake Client 2 and record the latency.
7.1 These requests result in cache hits.

Result

Method	Total Key Num	Data Size (MB)	Total Data Size (MB)	Local Cache Size (MB)	Get/BatchGet Per Round	Batch Size	Test Rounds	First Round Latency (ms)	Average Round Latency (ms)	Latency Delta (%)
get	25	10	250	0	25	-	10	767.64	881.02	-
get	25	10	250	400	25	-	10	1204.43	287.38	76.14
get	25	25	625	0	25	-	10	757.65	760.36	-
get	25	25	625	800	25	-	10	905.76	259.48	71.35
batchGet	25	10	250	0	5	5	10	489.24	463.28	-
batchGet	25	10	250	400	5	5	10	425.94	151.23	64.49
batchGet	25	25	625	0	5	5	10	433.15	440.47	-
batchGet	25	25	625	800	5	5	10	504.91	143.83	71.51

Note: The latency is measured for each key request, for BatchGet interface, first round latency is the average latency of first 2 key.

End to End Inference Performance Test

Environment

We run the test on a Kubernetes cluster that involves three Atlas 800I A2 machines. We deployed the inference service with vLLM v0 and Mooncake.

Method

Start master-service pod as Mooncake Master and redis pod as the metadata server.
Start 5 prefill pods and 5 decode pods running vLLM and Mooncake Client. Each client offers 30GB store memory.

2.1 Deploy 10 Mooncake Clients to increase the probability of remote slice data transmission.

2.2 Use TCP protocol and deploy across three machines to increase network transmission latency.

2.3 Prefill instances do not enable the local hot cache feature, while decode instances enable it.
Start proxy server pod for routing inference requests

3.1 Use the example proxy server from vLLM.
Start EvalScope script

4.1 Send 1000 inference requests at concurrency 5 or 10, with a fraction of exact duplicate requests to simulate hotspot/cache hits. Average input length is 4096 tokens and output length is fixed at 256 tokens.

4.2 Use a random dataset to vary prefix lengths, controlling prefix sharing across requests and thus the number of hot slices (and hit rate).

4.3 EvalScope summarizes and logs key performance metrics such as throughput, latency (p50/p95/p99), and success/error rates.

Test Script

vLLM prefill

export MOONCAKE_CONFIG_PATH=/app/mooncake_config.json
echo "{
    \"local_hostname\": \"$POD_IP\",
    \"metadata_server\": \"redis://redis:6379\",
    \"master_server_address\": \"mooncake-master:30089\",
    \"protocol\": \"tcp\",
    \"device_name\": \"\",
    \"global_segment_size\": 32212254720
}" > ${MOONCAKE_CONFIG_PATH}

VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-8B \
  --port 8100 \
  --max-model-len 10000 \
  --gpu-memory-utilization 0.8 \
  --block-size 128 \
  --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer","kv_buffer_device":"npu"}'

Note: This is the start command of the prefill pod. So POD_IP is the node address of this pod.

vLLM decode

export MOONCAKE_CONFIG_PATH=/app/mooncake_config.json
echo "{
    \"local_hostname\": \"$POD_IP\",
    \"metadata_server\": \"redis://redis:6379\",
    \"master_server_address\": \"mooncake-master:30089\",
    \"protocol\": \"tcp\",
    \"device_name\": \"\",
    \"global_segment_size\": 32212254720
}" > ${MOONCAKE_CONFIG_PATH}

VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-8B \
  --port 8200 \
  --max-model-len 10000 \
  --gpu-memory-utilization 0.8 \
  --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer","kv_buffer_device":"npu"}'

Note: This is the start command of the decode pod. So POD_IP is the node address of this pod.

EvalScope DataSet

import json
import random
import os
import numpy as np
from datetime import datetime
from typing import List, Dict
from modelscope import AutoTokenizer


# --- Core configuration ---
MODEL_NAME = "Qwen/Qwen3-8B"
REQUEST_LEN = 4096  # Target token length for each request
DATASET_SIZE = 1000  # Total number of requests
REPEAT_PERCENT = 0.5  # Ratio of identical (repeat) requests (0.0-1.0)

# Fixed seeds for reproducibility
RANDOM_SEED = 42

class IdenticalDatasetGenerator:
    """Dataset generator for identical/repeated requests."""
    
    def __init__(self, model_name: str = MODEL_NAME):
        """Initialize the generator and load the tokenizer."""
        self.model_name = model_name
        
        print(f"Loading tokenizer from ModelScope: {model_name} ...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        print(f"Tokenizer loaded. Vocab size: {self.tokenizer.vocab_size}")
        
    def generate_random_tokens(self, length: int) -> str:
        """
        Generate random text that encodes to the target number of tokens.
        A decode->encode roundtrip is used to approximate exact token count.
        
        Args:
            length: Target number of tokens.
            
        Returns:
            A text string (ideally encoding to exactly `length` tokens).
        """
        if length <= 0:
            return ""
        
        # Sample token IDs, decode to text, then re-encode to validate the count.
        token_ids = np.random.choice(
            self.tokenizer.vocab_size,
            size=length,
            replace=True
        ).tolist()
        
        text = self.tokenizer.decode(token_ids, skip_special_tokens=False)
        
        encoded_ids = self.tokenizer.encode(text, add_special_tokens=False)
        actual_length = len(encoded_ids)
        
        # If the roundtrip shrinks the token count, append additional tokens.
        if actual_length < length:
            needed = length - actual_length
            additional_ids = np.random.choice(
                self.tokenizer.vocab_size,
                size=needed,
                replace=True
            ).tolist()
            additional_text = self.tokenizer.decode(additional_ids, skip_special_tokens=False)
            text = text + additional_text
        
        return text
    
    def generate_request(self) -> str:
        """Generate a single request with REQUEST_LEN tokens."""
        return self.generate_random_tokens(REQUEST_LEN)
    
    def generate_dataset(self) -> List[Dict]:
        """
        Generate the full dataset.
        
        Returns:
            A list of items with `prompt`, plus metadata fields.
        """
        repeat_count = int(DATASET_SIZE * REPEAT_PERCENT)
        unique_count = DATASET_SIZE - repeat_count
        
        print("Generating identical-request dataset...")
        print(
            f"Config: total={DATASET_SIZE}, unique={unique_count}, identical={repeat_count} "
            f"({REPEAT_PERCENT:.1%}), tokens_per_request={REQUEST_LEN}"
        )
        
        dataset = []
        
        identical_request = self.generate_request()
        
        print(f"Generating {unique_count} unique requests...")
        unique_requests = []
        for i in range(unique_count):
            if (i + 1) % 100 == 0:
                print(f"Progress: {i+1}/{unique_count} unique requests generated")
            unique_request = self.generate_request()
            unique_requests.append(unique_request)
        
        print("Building dataset...")
        
        for i, request in enumerate(unique_requests):
            dataset.append({
                "prompt": request,
                "is_identical": False,
                "request_id": i,
            })
        
        for i in range(repeat_count):
            dataset.append({
                "prompt": identical_request,
                "is_identical": True,
                "request_id": unique_count + i,
            })
        
        print("Shuffling dataset...")
        random.shuffle(dataset)
        
        # Reassign request_id after shuffling.
        for i, item in enumerate(dataset):
            item["request_id"] = i
        
        print("Dataset generation complete.")
        return dataset
    
    def save_dataset(self, dataset: List[Dict], output_dir: str = "."):
        """
        Save the dataset to a JSON file (timestamped for uniqueness).
        
        Args:
            dataset: Dataset items.
            output_dir: Output directory.
        """
        os.makedirs(output_dir, exist_ok=True)
        
        identical_count = sum(1 for item in dataset if item.get("is_identical", False))
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"identical_data_{DATASET_SIZE}total_{REQUEST_LEN}len_{REPEAT_PERCENT*100:.0f}pct_repeat_{timestamp}.json"
        filepath = os.path.join(output_dir, filename)
        
        print(f"Saving dataset to: {filepath}")
        
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(dataset, f, ensure_ascii=False, indent=2)
        
        file_size_mb = os.path.getsize(filepath) / 1024 / 1024
        print(
            f"Saved. size={file_size_mb:.2f} MB, total={len(dataset)}, "
            f"identical={identical_count}, unique={len(dataset) - identical_count}"
        )
        return filepath


def main():
    random.seed(RANDOM_SEED)
    np.random.seed(RANDOM_SEED)
    
    generator = IdenticalDatasetGenerator(model_name=MODEL_NAME)
    dataset = generator.generate_dataset()
    generator.save_dataset(dataset, output_dir="./identical_dataset")


if __name__ == "__main__":
    main()

Result

Parallel	Identical Request Ratio	Local Cache Size (GB)	Throughput (tkn/s)	TTFT (s)	TPOT (s)	Throughput Delta	TTFT Delta	TPOT Delta
5	50	0	577.89	26.18	0.0520	0.00%	0.00%	0.00%
5	50	1	745.41	17.02	0.0531	28.99%	-35.00%	2.12%
5	50	2	820.39	13.94	0.0543	41.96%	-46.74%	4.42%
5	50	4	889.92	11.73	0.0544	53.99%	-55.20%	4.62%
10	50	0	772.65	42.06	0.0671	0.00%	0.00%	0.00%
10	50	1	956.47	27.94	0.0778	23.79%	-33.58%	15.95%
10	50	2	1037.45	25.33	0.0727	34.27%	-39.78%	8.35%
10	50	4	1105.28	22.66	0.0723	43.05%	-46.12%	7.75%
5	70	0	549.76	28.63	0.0508	0.00%	0.00%	0.00%
5	70	1	1029.84	9.13	0.0511	87.33%	-68.10%	0.59%
5	70	2	1139.46	6.51	0.0530	107.26%	-77.27%	4.33%
5	70	4	1108.51	7.44	0.0514	101.64%	-73.99%	1.18%
10	70	0	761.69	42.41	0.0681	0.00%	0.00%	0.00%
10	70	1	1401.02	14.25	0.0717	83.93%	-66.39%	5.29%
10	70	2	1537.34	13.44	0.0633	101.83%	-68.30%	-7.05%
10	70	4	1502.74	13.66	0.0656	97.29%	-67.78%	-3.67%

Checklist

I have performed a self-review of my own code.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist · 2025-12-17T11:33:18Z

Summary of Changes

Hello @Shichang-Zhang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a local hot cache for the Mooncake Store client, designed to optimize data retrieval in distributed environments. By caching frequently accessed data locally, the system minimizes the overhead of repeated network transfers, leading to significant performance gains. The feature is configurable via an environment variable, allowing users to tailor cache size to their specific deployment needs, and includes robust handling for various configuration inputs.

Highlights

New Feature: Local Hot Cache: Introduces a local hot cache mechanism for the Mooncake Store client to improve performance by reducing cross-node data transfers.
Configuration: The hot cache is enabled and configured by setting the LOCAL_HOT_CACHE_SIZE environment variable to a value greater than 16MB (in bytes). Memory is allocated in 16MB units.
Usage Scenarios: Recommended for architectures with multiple clients and a master, where it can significantly reduce latency. It is disabled for standalone client scenarios.
Performance Improvement: Performance tests demonstrate substantial latency reductions, with 'get' operations showing up to 76.14% improvement and 'batchGet' operations up to 71.51% after initial data retrieval.
Asynchronous Caching: Slices transferred from remote nodes are asynchronously added to the local hot cache, preventing blocking of read operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant performance enhancement by adding a client-side local hot cache. The cache is designed to store frequently accessed data slices, reducing cross-node data transfers. The implementation includes an LRU cache (LocalHotCache), an asynchronous worker pool (LocalHotCacheHandler) to populate the cache without blocking the main Get/BatchGet path, and configuration via the LOCAL_HOT_CACHE_SIZE environment variable. The changes are well-structured and include a comprehensive set of unit and integration tests.

My review has identified a few critical issues, including merge conflicts and compilation errors, that must be addressed. I've also pointed out a potential regression in the Python bindings and offered suggestions to improve performance in hot paths and simplify the code. Overall, this is a great feature addition.

mooncake-store/include/client_service.h

mooncake-store/src/CMakeLists.txt

mooncake-store/src/client_service.cpp

gemini-code-assist · 2025-12-17T11:36:11Z

mooncake-store/src/client_service.cpp

+        // Check for negative values
+        if (!ev_size_str.empty() && ev_size_str[0] == '-') {
+            LOG(WARNING) << "Invalid LOCAL_HOT_CACHE_SIZE='" << ev_size_str << "', disable local hot cache";
+            return ErrorCode::INVALID_PARAMS;
+        }


This check for a negative number ev_size_str[0] == '-' is redundant. std::stoull will throw a std::invalid_argument exception if the string starts with a '-', which is already caught by the try-catch block. You can remove this if block to simplify the code.

mooncake-store/src/local_hot_cache.cpp

codecov-commenter · 2026-01-05T07:54:32Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 84.38287% with 124 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/src/local_hot_cache.cpp	77.29%	42 Missing ⚠️
mooncake-store/src/client_service.cpp	72.05%	38 Missing ⚠️
...oncake-store/tests/client_local_hot_cache_test.cpp	93.73%	28 Missing ⚠️
mooncake-store/src/transfer_task.cpp	0.00%	16 Missing ⚠️

📢 Thoughts on this report? Let us know!

stmatengss · 2026-01-06T14:52:35Z

chaos_rand_test failed, halting the build process.

@zhuxinjie-nz Some codes are related to you, Please review the code if possible.

Copilot

Pull request overview

This PR introduces a local hot cache mechanism for the Mooncake Store Client to improve performance by caching frequently accessed data locally, reducing cross-node data transfers. The cache is configurable via environment variables LOCAL_HOT_CACHE_SIZE (total cache size in bytes) and LOCAL_HOT_BLOCK_SIZE (block size in bytes, default 16MB). The feature is designed for multi-client architectures with a master node, where remote data fetches can benefit from local caching.

Key Changes:

Implements an LRU-based local hot cache with configurable memory allocation
Integrates cache checking and updating into Get and BatchGet operations
Adds asynchronous cache population using a worker thread pool
Improves local transfer detection by comparing IP addresses instead of full endpoints

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
mooncake-store/include/local_hot_cache.h	Defines LocalHotCache and LocalHotCacheHandler classes with LRU cache management
mooncake-store/src/local_hot_cache.cpp	Implements LRU cache operations, memory management, and async task processing
mooncake-store/include/client_service.h	Adds cache-related public and private methods to Client class
mooncake-store/src/client_service.cpp	Integrates cache into Get/BatchGet workflows, adds environment variable parsing
mooncake-store/src/transfer_task.cpp	Adds IP address extraction helper for improved local transfer detection
mooncake-store/tests/client_local_hot_cache_test.cpp	Comprehensive test suite for cache functionality and client integration
mooncake-store/tests/CMakeLists.txt	Adds new test file to build configuration
mooncake-store/src/CMakeLists.txt	Adds local_hot_cache.cpp to build sources
mooncake-integration/store/store_py.cpp	Removes trailing whitespace (unrelated formatting fix)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mooncake-store/src/client_service.cpp

mooncake-store/include/local_hot_cache.h

Copilot · 2026-01-06T14:59:54Z

mooncake-store/include/local_hot_cache.h

+struct HotMemBlock {
+    void* addr;   // Memory address
+    size_t size;  // Block size in bytes
+    bool in_use;  // Whether the block is currently in use


The HotMemBlock struct has an in_use field that is set in the code but never actually read or checked anywhere in the implementation. This field appears to be unused and should either be utilized in the logic or removed to reduce confusion and memory overhead.

Suggested change

bool in_use; // Whether the block is currently in use

[[maybe_unused]] bool in_use; // Whether the block is currently in use

mooncake-store/src/transfer_task.cpp

Copilot · 2026-01-06T14:59:55Z

mooncake-store/src/client_service.cpp

+        cache_hits++;
+        mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_;
+        mem_desc.buffer_descriptor.buffer_address_ =
+            reinterpret_cast<uintptr_t>(blk->addr);
+        if (mem_desc.buffer_descriptor.size_ != blk->size) {
+            LOG(WARNING) << "Cache hit but size mismatch for key: " << key;


When a cache hit occurs but the size doesn't match, only a warning is logged but the cached data is still used. This could lead to data corruption if the cached block size is smaller than expected, as the transfer operation may read beyond the cached block's memory. Consider returning 0 (cache miss) when sizes don't match to force a proper remote fetch, or validate that blk->size is at least as large as the expected size before using the cached data.

Suggested change

cache_hits++;

mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_;

mem_desc.buffer_descriptor.buffer_address_ =

reinterpret_cast<uintptr_t>(blk->addr);

if (mem_desc.buffer_descriptor.size_ != blk->size) {

LOG(WARNING) << "Cache hit but size mismatch for key: " << key;

// Validate that the cached block is large enough before using it.

if (blk->size >= mem_desc.buffer_descriptor.size_) {

mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_;

mem_desc.buffer_descriptor.buffer_address_ =

reinterpret_cast<uintptr_t>(blk->addr);

cache_hits++;

if (blk->size != mem_desc.buffer_descriptor.size_) {

LOG(WARNING) << "Cache hit with larger-than-expected size for key: "

<< key << " (expected=" << mem_desc.buffer_descriptor.size_

<< ", cached=" << blk->size << ")";

}

} else {

// Cached block is smaller than expected; treat as cache miss to avoid

// potential out-of-bounds access when transferring data.

LOG(WARNING) << "Cache hit but cached block is smaller than expected for key: "

<< key << " (expected=" << mem_desc.buffer_descriptor.size_

<< ", cached=" << blk->size << ")";

mooncake-store/src/client_service.cpp

mooncake-store/include/client_service.h

mooncake-store/include/local_hot_cache.h

mooncake-store/src/local_hot_cache.cpp

wanyue-wy · 2026-02-08T15:04:16Z

mooncake-store/src/client_service.cpp

+    err = client->InitLocalHotCache();
+    if (err != ErrorCode::OK) {
+        LOG(ERROR) << "Failed to initialize local hot cache";
+    }


On failure path, we should return std::nullopt to user for telling the current environment doesn’t meet the configuration requirements.

This failure path is a fallback path, use an error log to tell user the configuration does not take effect. Should this wrong performance configuration block the initialization of Mooncake?

Yes. I think if the Store cannot run under the condition of meeting the user's configuration, then the initialization should fail directly. For users, the failure of cluster startup will be the most direct feedback, and they will have to modify their startup configuration or check their environment

mooncake-store/src/client_service.cpp

mooncake-store/src/local_hot_cache.cpp

wanyue-wy · 2026-02-08T16:09:18Z

mooncake-store/src/local_hot_cache.cpp

+                Slice slice;
+                slice.ptr = task.data.data();
+                slice.size = task.size;
+                task.hot_cache->PutHotSlice(task.key, slice);


The handle already holds a hot_cache_ pointer, so the task doesn’t need to store another hot_cache pointer.

if PutHotSlice() returns false, we should print error log

mooncake-store/include/local_hot_cache.h

mooncake-store/src/transfer_task.cpp

…lock 16MB

… block but the hot block is modified by incoming put actions

…py operation when inserting local hot cache

wanyue-wy · 2026-02-11T08:42:21Z

mooncake-store/src/client_service.cpp

+    if (mem_desc.buffer_descriptor.size_ != blk->size) {
+        LOG(ERROR) << "Cache hit but size mismatch for key: " << key;
+        return false;
+    }


print the both sizes in the log

wanyue-wy · 2026-02-11T08:44:31Z

mooncake-store/src/client_service.cpp

+    // Only cache slices that came from TE transfer (non-local).
+    if (mem_desc.buffer_descriptor.transport_endpoint_ == local_hostname_) {
+        return;
+    }


wrapper this condition as a function which decides whether need to SubmitPutTask. Each caller of ProcessSlicesAsync() should check this condition function

wanyue-wy · 2026-02-11T08:50:40Z

mooncake-store/src/client_service.cpp

+    // Identify TE transfer slices (non-local) and submit async put tasks
+    for (size_t i = 0; i < slices.size(); ++i) {
+        if (!hot_cache_handler_->SubmitPutTask(key, slices[i])) {
+            LOG(ERROR) << "Failed to submit hot cache put task for key=" << key
+                       << " slice_idx=" << i;
+            return;
+        }


By defaut, each key only has one slice currently. we should not use a loop here. I think it is better to add a defensive checking to verify whether the slice num is one

wanyue-wy · 2026-02-11T08:51:09Z

mooncake-store/src/client_service.cpp

+            LOG(ERROR) << "Failed to submit hot cache put task for key=" << key
+                       << " slice_idx=" << i;


just use warning log is enough

wanyue-wy · 2026-02-11T08:56:26Z

mooncake-store/src/local_hot_cache.cpp

+    }
+}
+
+bool LocalHotCache::PutHotKey(HotMemBlock* block) {


This function handle two logics: releasing free block and enqueuing allocated blocks (enqueuing at the end and the beginning of lru_queue respectively). However, for easier code maintenance, I think it is better to split lru_queue into two parts: one is free_block_list which just allocates and release memory and one is lru_queue which is similar like current.

wanyue-wy · 2026-02-11T08:57:20Z

mooncake-store/include/local_hot_cache.h

+     * @param block The block containing the data and key.
+     * @return true if inserted successfully, false if race condition or error.
+     */
+    bool PutHotKey(HotMemBlock* block);


each function should use tl::expected<T,ErrorCode> as return value

wanyue-wy · 2026-02-11T08:59:13Z

mooncake-store/src/local_hot_cache.cpp

+    return key_to_lru_it_.find(key) != key_to_lru_it_.end();
+}
+
+HotMemBlock* LocalHotCache::GetHotKey(const std::string& key) {


It is better to wrapper a HotCacheGuard which stores HotMemBlock. Then, GetHotKey() should return the guard. When it is destructured, it will auto to call ReleaseHotKey.

wanyue-wy · 2026-02-11T08:59:52Z

mooncake-store/src/local_hot_cache.cpp

+    if (victim_it == lru_queue_.end()) {
+        return nullptr;
+    }


print a warning log to explain the queue is full

wanyue-wy · 2026-02-11T09:01:01Z

mooncake-store/src/local_hot_cache.cpp

+    // Optimization: if key exists, just touch LRU to avoid data copy
+    if (hot_cache_->TouchHotKey(key)) {
+        return true;
+    }


just check the existence of key is enough. It is no need to touch lru list

wanyue-wy · 2026-02-11T09:01:53Z

mooncake-store/src/local_hot_cache.cpp

+                if (task.hot_cache->PutHotKey(task.block)) {
+                    VLOG(2) << "Put task completed: " << task.key;
+                } else {


it is no need to store hot_cache ptr in the task. The LocalHotCacheHandler already has the ptr

wanyue-wy · 2026-02-11T09:21:11Z

In the following advanced PR based on this basic PR should to resolve the following tasks:

Optimize the LRU mechanism: currently each cache read request is blocked by global lru lock. The granularity of the lock is too large. We should consider a more effective concurrent algorithm.
Sync memcpy when Cache task enqueue: the capacity of hot_cache is further smaller than that of the cluster storage. Cache misses are very common. Then, each missed scene will trigger the enqueuing of the cache task. At this point, the synchronized memcpy will seriously affect the overall performance
the hot cache should be compatible with P2P-Mooncake-Store Structure

…ai#1226) * feat(Store): add local hot cache for client * feat(Store): add client local hot cache log to show performance * fix: local hot cache initialize bug * fix(Store): Mooncake put slice is max 16MB， so make local hot cache block 16MB * feat(Store): move local hot cache initialization to Client::Create * feat(Store): local hot cache remove unused small block implementation * feat(Store): add client local hot cache unit test * fix(Store): modify client local hot cache suit with v0.3.7 * feat(Store): change local hot cache unit tes * fix: initialize local hot cache with negative value * feat: use in process master and metadata fro local hot cache unit test. * feat: update local hot cache to one replica one slice version * fix: local hot cache unit test use in process master service * fix: code style fix * fix: fix dirty read when client wants to read a previously hitted hot block but the hot block is modified by incoming put actions * fix: local hot cache unit test use in process master service * fix: code format fix * fix: fix comment problems for * feat: add local hot asynchronous queue size limit * fix: local hot cache task involves the block so that there is no memcpy operation when inserting local hot cache * fix: code check fix * fix: update block in_use prop to reference count --------- Co-authored-by: shichangzhang064 <zhangshichang@h-partners.com>

Shichang-Zhang requested review from XucSh, stmatengss and ykwd as code owners December 17, 2025 11:32

github-actions bot added run-ci Store labels Dec 17, 2025

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

stmatengss self-assigned this Dec 23, 2025

stmatengss added the enhancement New feature or request label Dec 23, 2025

stmatengss assigned YiXR and unassigned stmatengss Dec 24, 2025

Shichang-Zhang force-pushed the upstream-zsc-working branch from fe332db to 828c487 Compare January 5, 2026 07:32

Shichang-Zhang requested a review from YiXR as a code owner January 5, 2026 07:32

Shichang-Zhang force-pushed the upstream-zsc-working branch 3 times, most recently from a98d252 to 44b6b7d Compare January 6, 2026 01:21

stmatengss requested review from Copilot and zhuxinjie-nz January 6, 2026 14:51

Copilot started reviewing on behalf of stmatengss January 6, 2026 14:51 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

Shichang-Zhang force-pushed the upstream-zsc-working branch from 65d1a8b to 22a8688 Compare January 22, 2026 07:54

wanyue-wy reviewed Feb 8, 2026

View reviewed changes

XucSh mentioned this pull request Feb 10, 2026

Support local cache with get api #1530

Closed

16 tasks

shichangzhang064 added 5 commits February 10, 2026 22:38

feat(Store): add local hot cache for client

36ee91b

feat(Store): add client local hot cache log to show performance

4158e59

fix: local hot cache initialize bug

9d4aa0d

fix(Store): Mooncake put slice is max 16MB， so make local hot cache b…

e9686ed

…lock 16MB

feat(Store): move local hot cache initialization to Client::Create

6453b95

shichangzhang064 added 14 commits February 10, 2026 22:38

feat(Store): local hot cache remove unused small block implementation

6768f3c

feat(Store): add client local hot cache unit test

8848351

fix(Store): modify client local hot cache suit with v0.3.7

602801e

feat(Store): change local hot cache unit tes

9d57b16

fix: initialize local hot cache with negative value

19cca30

feat: use in process master and metadata fro local hot cache unit test.

a7ab11d

feat: update local hot cache to one replica one slice version

252ccb2

fix: local hot cache unit test use in process master service

e1e6a41

fix: code style fix

0cca020

fix: fix dirty read when client wants to read a previously hitted hot…

eea23c4

… block but the hot block is modified by incoming put actions

fix: local hot cache unit test use in process master service

e603fa2

fix: code format fix

93ce1a2

fix: fix comment problems for

3ad6be2

feat: add local hot asynchronous queue size limit

27d0ab9

Shichang-Zhang force-pushed the upstream-zsc-working branch 4 times, most recently from 7f9b6df to 68c54f4 Compare February 10, 2026 15:20

fix: local hot cache task involves the block so that there is no memc…

025c4c2

…py operation when inserting local hot cache

Shichang-Zhang force-pushed the upstream-zsc-working branch from 68c54f4 to 025c4c2 Compare February 10, 2026 15:36

fix: code check fix

b1a3b1f

Shichang-Zhang force-pushed the upstream-zsc-working branch 2 times, most recently from bfbc7cc to a081245 Compare February 10, 2026 17:44

fix: update block in_use prop to reference count

fd932da

Shichang-Zhang force-pushed the upstream-zsc-working branch from a081245 to fd932da Compare February 10, 2026 17:50

stmatengss approved these changes Feb 11, 2026

View reviewed changes

stmatengss merged commit f21b379 into kvcache-ai:main Feb 11, 2026
16 checks passed

wanyue-wy reviewed Feb 11, 2026

View reviewed changes

	bool in_use; // Whether the block is currently in use
	[[maybe_unused]] bool in_use; // Whether the block is currently in use

		LOG(ERROR) << "Failed to submit hot cache put task for key=" << key
		<< " slice_idx=" << i;

Comments

Conversation

Shichang-Zhang commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

How Has This Been Tested?

Performance Test

Software

Hardware

Mooncake Performance Test

Environment

Method

Result

End to End Inference Performance Test

Environment

Method

Test Script

Result

Checklist

Uh oh!

gemini-code-assist bot commented Dec 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stmatengss commented Jan 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Shichang-Zhang commented Dec 17, 2025 •

edited

Loading

codecov-commenter commented Jan 5, 2026 •

edited

Loading