Add Local Cache Mechanism for Mooncake Store Client#1226
Add Local Cache Mechanism for Mooncake Store Client#1226stmatengss merged 22 commits intokvcache-ai:mainfrom
Conversation
Summary of ChangesHello @Shichang-Zhang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a local hot cache for the Mooncake Store client, designed to optimize data retrieval in distributed environments. By caching frequently accessed data locally, the system minimizes the overhead of repeated network transfers, leading to significant performance gains. The feature is configurable via an environment variable, allowing users to tailor cache size to their specific deployment needs, and includes robust handling for various configuration inputs. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance enhancement by adding a client-side local hot cache. The cache is designed to store frequently accessed data slices, reducing cross-node data transfers. The implementation includes an LRU cache (LocalHotCache), an asynchronous worker pool (LocalHotCacheHandler) to populate the cache without blocking the main Get/BatchGet path, and configuration via the LOCAL_HOT_CACHE_SIZE environment variable. The changes are well-structured and include a comprehensive set of unit and integration tests.
My review has identified a few critical issues, including merge conflicts and compilation errors, that must be addressed. I've also pointed out a potential regression in the Python bindings and offered suggestions to improve performance in hot paths and simplify the code. Overall, this is a great feature addition.
| // Check for negative values | ||
| if (!ev_size_str.empty() && ev_size_str[0] == '-') { | ||
| LOG(WARNING) << "Invalid LOCAL_HOT_CACHE_SIZE='" << ev_size_str << "', disable local hot cache"; | ||
| return ErrorCode::INVALID_PARAMS; | ||
| } |
fe332db to
828c487
Compare
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
a98d252 to
44b6b7d
Compare
|
@zhuxinjie-nz Some codes are related to you, Please review the code if possible. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a local hot cache mechanism for the Mooncake Store Client to improve performance by caching frequently accessed data locally, reducing cross-node data transfers. The cache is configurable via environment variables LOCAL_HOT_CACHE_SIZE (total cache size in bytes) and LOCAL_HOT_BLOCK_SIZE (block size in bytes, default 16MB). The feature is designed for multi-client architectures with a master node, where remote data fetches can benefit from local caching.
Key Changes:
- Implements an LRU-based local hot cache with configurable memory allocation
- Integrates cache checking and updating into
GetandBatchGetoperations - Adds asynchronous cache population using a worker thread pool
- Improves local transfer detection by comparing IP addresses instead of full endpoints
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| mooncake-store/include/local_hot_cache.h | Defines LocalHotCache and LocalHotCacheHandler classes with LRU cache management |
| mooncake-store/src/local_hot_cache.cpp | Implements LRU cache operations, memory management, and async task processing |
| mooncake-store/include/client_service.h | Adds cache-related public and private methods to Client class |
| mooncake-store/src/client_service.cpp | Integrates cache into Get/BatchGet workflows, adds environment variable parsing |
| mooncake-store/src/transfer_task.cpp | Adds IP address extraction helper for improved local transfer detection |
| mooncake-store/tests/client_local_hot_cache_test.cpp | Comprehensive test suite for cache functionality and client integration |
| mooncake-store/tests/CMakeLists.txt | Adds new test file to build configuration |
| mooncake-store/src/CMakeLists.txt | Adds local_hot_cache.cpp to build sources |
| mooncake-integration/store/store_py.cpp | Removes trailing whitespace (unrelated formatting fix) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| struct HotMemBlock { | ||
| void* addr; // Memory address | ||
| size_t size; // Block size in bytes | ||
| bool in_use; // Whether the block is currently in use |
There was a problem hiding this comment.
The HotMemBlock struct has an in_use field that is set in the code but never actually read or checked anywhere in the implementation. This field appears to be unused and should either be utilized in the logic or removed to reduce confusion and memory overhead.
| bool in_use; // Whether the block is currently in use | |
| [[maybe_unused]] bool in_use; // Whether the block is currently in use |
| cache_hits++; | ||
| mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_; | ||
| mem_desc.buffer_descriptor.buffer_address_ = | ||
| reinterpret_cast<uintptr_t>(blk->addr); | ||
| if (mem_desc.buffer_descriptor.size_ != blk->size) { | ||
| LOG(WARNING) << "Cache hit but size mismatch for key: " << key; |
There was a problem hiding this comment.
When a cache hit occurs but the size doesn't match, only a warning is logged but the cached data is still used. This could lead to data corruption if the cached block size is smaller than expected, as the transfer operation may read beyond the cached block's memory. Consider returning 0 (cache miss) when sizes don't match to force a proper remote fetch, or validate that blk->size is at least as large as the expected size before using the cached data.
| cache_hits++; | |
| mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_; | |
| mem_desc.buffer_descriptor.buffer_address_ = | |
| reinterpret_cast<uintptr_t>(blk->addr); | |
| if (mem_desc.buffer_descriptor.size_ != blk->size) { | |
| LOG(WARNING) << "Cache hit but size mismatch for key: " << key; | |
| // Validate that the cached block is large enough before using it. | |
| if (blk->size >= mem_desc.buffer_descriptor.size_) { | |
| mem_desc.buffer_descriptor.transport_endpoint_ = local_hostname_; | |
| mem_desc.buffer_descriptor.buffer_address_ = | |
| reinterpret_cast<uintptr_t>(blk->addr); | |
| cache_hits++; | |
| if (blk->size != mem_desc.buffer_descriptor.size_) { | |
| LOG(WARNING) << "Cache hit with larger-than-expected size for key: " | |
| << key << " (expected=" << mem_desc.buffer_descriptor.size_ | |
| << ", cached=" << blk->size << ")"; | |
| } | |
| } else { | |
| // Cached block is smaller than expected; treat as cache miss to avoid | |
| // potential out-of-bounds access when transferring data. | |
| LOG(WARNING) << "Cache hit but cached block is smaller than expected for key: " | |
| << key << " (expected=" << mem_desc.buffer_descriptor.size_ | |
| << ", cached=" << blk->size << ")"; |
65d1a8b to
22a8688
Compare
| err = client->InitLocalHotCache(); | ||
| if (err != ErrorCode::OK) { | ||
| LOG(ERROR) << "Failed to initialize local hot cache"; | ||
| } |
There was a problem hiding this comment.
On failure path, we should return std::nullopt to user for telling the current environment doesn’t meet the configuration requirements.
There was a problem hiding this comment.
This failure path is a fallback path, use an error log to tell user the configuration does not take effect. Should this wrong performance configuration block the initialization of Mooncake?
There was a problem hiding this comment.
Yes. I think if the Store cannot run under the condition of meeting the user's configuration, then the initialization should fail directly. For users, the failure of cluster startup will be the most direct feedback, and they will have to modify their startup configuration or check their environment
| Slice slice; | ||
| slice.ptr = task.data.data(); | ||
| slice.size = task.size; | ||
| task.hot_cache->PutHotSlice(task.key, slice); |
There was a problem hiding this comment.
- The handle already holds a hot_cache_ pointer, so the task doesn’t need to store another hot_cache pointer.
- if PutHotSlice() returns false, we should print error log
… block but the hot block is modified by incoming put actions
7f9b6df to
68c54f4
Compare
…py operation when inserting local hot cache
68c54f4 to
025c4c2
Compare
bfbc7cc to
a081245
Compare
a081245 to
fd932da
Compare
| if (mem_desc.buffer_descriptor.size_ != blk->size) { | ||
| LOG(ERROR) << "Cache hit but size mismatch for key: " << key; | ||
| return false; | ||
| } |
There was a problem hiding this comment.
print the both sizes in the log
| // Only cache slices that came from TE transfer (non-local). | ||
| if (mem_desc.buffer_descriptor.transport_endpoint_ == local_hostname_) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
wrapper this condition as a function which decides whether need to SubmitPutTask. Each caller of ProcessSlicesAsync() should check this condition function
| // Identify TE transfer slices (non-local) and submit async put tasks | ||
| for (size_t i = 0; i < slices.size(); ++i) { | ||
| if (!hot_cache_handler_->SubmitPutTask(key, slices[i])) { | ||
| LOG(ERROR) << "Failed to submit hot cache put task for key=" << key | ||
| << " slice_idx=" << i; | ||
| return; | ||
| } |
There was a problem hiding this comment.
By defaut, each key only has one slice currently. we should not use a loop here. I think it is better to add a defensive checking to verify whether the slice num is one
| LOG(ERROR) << "Failed to submit hot cache put task for key=" << key | ||
| << " slice_idx=" << i; |
There was a problem hiding this comment.
just use warning log is enough
| } | ||
| } | ||
|
|
||
| bool LocalHotCache::PutHotKey(HotMemBlock* block) { |
There was a problem hiding this comment.
This function handle two logics: releasing free block and enqueuing allocated blocks (enqueuing at the end and the beginning of lru_queue respectively). However, for easier code maintenance, I think it is better to split lru_queue into two parts: one is free_block_list which just allocates and release memory and one is lru_queue which is similar like current.
| * @param block The block containing the data and key. | ||
| * @return true if inserted successfully, false if race condition or error. | ||
| */ | ||
| bool PutHotKey(HotMemBlock* block); |
There was a problem hiding this comment.
each function should use tl::expected<T,ErrorCode> as return value
| return key_to_lru_it_.find(key) != key_to_lru_it_.end(); | ||
| } | ||
|
|
||
| HotMemBlock* LocalHotCache::GetHotKey(const std::string& key) { |
There was a problem hiding this comment.
It is better to wrapper a HotCacheGuard which stores HotMemBlock. Then, GetHotKey() should return the guard. When it is destructured, it will auto to call ReleaseHotKey.
| if (victim_it == lru_queue_.end()) { | ||
| return nullptr; | ||
| } |
There was a problem hiding this comment.
print a warning log to explain the queue is full
| // Optimization: if key exists, just touch LRU to avoid data copy | ||
| if (hot_cache_->TouchHotKey(key)) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
just check the existence of key is enough. It is no need to touch lru list
| if (task.hot_cache->PutHotKey(task.block)) { | ||
| VLOG(2) << "Put task completed: " << task.key; | ||
| } else { |
There was a problem hiding this comment.
it is no need to store hot_cache ptr in the task. The LocalHotCacheHandler already has the ptr
|
In the following advanced PR based on this basic PR should to resolve the following tasks:
|
…ai#1226) * feat(Store): add local hot cache for client * feat(Store): add client local hot cache log to show performance * fix: local hot cache initialize bug * fix(Store): Mooncake put slice is max 16MB, so make local hot cache block 16MB * feat(Store): move local hot cache initialization to Client::Create * feat(Store): local hot cache remove unused small block implementation * feat(Store): add client local hot cache unit test * fix(Store): modify client local hot cache suit with v0.3.7 * feat(Store): change local hot cache unit tes * fix: initialize local hot cache with negative value * feat: use in process master and metadata fro local hot cache unit test. * feat: update local hot cache to one replica one slice version * fix: local hot cache unit test use in process master service * fix: code style fix * fix: fix dirty read when client wants to read a previously hitted hot block but the hot block is modified by incoming put actions * fix: local hot cache unit test use in process master service * fix: code format fix * fix: fix comment problems for * feat: add local hot asynchronous queue size limit * fix: local hot cache task involves the block so that there is no memcpy operation when inserting local hot cache * fix: code check fix * fix: update block in_use prop to reference count --------- Co-authored-by: shichangzhang064 <zhangshichang@h-partners.com>
Description
This PR introduces the client hot cache feature mentioned in Issue 1062.
This feature is enabled by setting the environment variable
LOCAL_HOT_CACHE_SIZE(in bytes). The mooncake store client will allocate memory blocks for caching hot data on a per-request basis. The block size is configurable via the environment variableLOCAL_HOT_BLOCK_SIZE(in bytes), with a default value of 16MB. Memory allocation follows the configured block size unit. IfLOCAL_HOT_CACHE_SIZEis set to less than the block size (either the default 16MB or the value specified byLOCAL_HOT_BLOCK_SIZE), zero, negative, or other invalid inputs (such as non-numeric strings), the client hot cache feature will be disabled.For the standalone client scenario, the feature should be disabled since all things are in local storage.
For the architecture of multiple clients and a master, the feature is recommended for use. Users should adjust the LOCAL_HOT_CACHE_SIZE value based on their actual deployment context to strike an optimal balance: the performance gain from reduced cross-node data transfer outweighs the overhead of additional local cache memory.
Type of Change
How Has This Been Tested?
Run unit test for the feature
mooncake-store/tests/client_local_hot_cache_test.cpp.Performance Test
Software
Hardware
Mooncake Performance Test
Environment
We run the test on a Kubernetes cluster that involves two Atlas 800I A2 machines.
Method
Start master-service pod as Mooncake Master and redis pod as the metadata server.
Start a Mooncake Client pod on node 1.
Call
MooncakeDistributedStore:Putinterface to store data (e.g., 1GB) on Mooncake Client 1.3.1 Store 10 keys with corresponding data (e.g., 1GB total) in Mooncake.
3.2 All data slices are stored on node 1 since only one client exists.
Start another Mooncake Client pod with local hot cache (e.g., 1GB) enabled on node 2.
Call
MooncakeDistributedStore:Get/BatchGetinterface from Mooncake Client 2 to retrieve the stored data.5.1 All data slices are on node 1, so all slices requested by Mooncake Client 2 will be transferred remotely.
5.2 Use TCP protocol to increase network transmission latency.
5.3 After retrieving data from remote node 1, the slices are stored in the local hot cache.
Record the latency of the Get/BatchGet calls.
Repeatedly call
MooncakeDistributedStore:Get/BatchGetinterface from Mooncake Client 2 and record the latency.7.1 These requests result in cache hits.
Result
Note: The latency is measured for each key request, for
BatchGetinterface, first round latency is the average latency of first 2 key.End to End Inference Performance Test
Environment
We run the test on a Kubernetes cluster that involves three Atlas 800I A2 machines. We deployed the inference service with vLLM v0 and Mooncake.
Method
Start master-service pod as Mooncake Master and redis pod as the metadata server.
Start 5 prefill pods and 5 decode pods running vLLM and Mooncake Client. Each client offers 30GB store memory.
2.1 Deploy 10 Mooncake Clients to increase the probability of remote slice data transmission.
2.2 Use TCP protocol and deploy across three machines to increase network transmission latency.
2.3 Prefill instances do not enable the local hot cache feature, while decode instances enable it.
Start proxy server pod for routing inference requests
3.1 Use the example proxy server from vLLM.
Start EvalScope script
4.1 Send 1000 inference requests at concurrency 5 or 10, with a fraction of exact duplicate requests to simulate hotspot/cache hits. Average input length is 4096 tokens and output length is fixed at 256 tokens.
4.2 Use a random dataset to vary prefix lengths, controlling prefix sharing across requests and thus the number of hot slices (and hit rate).
4.3 EvalScope summarizes and logs key performance metrics such as throughput, latency (p50/p95/p99), and success/error rates.
Test Script
Note: This is the start command of the prefill pod. So
POD_IPis the node address of this pod.Note: This is the start command of the decode pod. So
POD_IPis the node address of this pod.Result
Checklist