Skip to content

Conversation

@sakunkun
Copy link
Contributor

@sakunkun sakunkun commented Oct 21, 2025

Purpose

To trigger both LMCache save and load operations in this example.

Test Plan

Test Result

before
The second request only triggers a cache hit in LMCache, still using the local cache.

INFO 10-20 21:45:45 [llm.py:306] Supported_tasks: ['generate']
Adding requests: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.45it/s]
Processed prompts:   0%|               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=16174) [2025-10-20 21:45:45,601] LMCache INFO: Reqid: 0, Total tokens 6005, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1191:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=16174) [2025-10-20 21:45:45,609] LMCache INFO: Post-initializing LMCacheEngine (cache_engine.py:170:lmcache.v1.cache_engine)
(EngineCore_DP0 pid=16174) [2025-10-20 21:45:45,728] LMCache INFO: Storing KV cache for 6005 out of 6005 tokens (skip_leading_tokens=0) for request 0 (vllm_v1_adapter.py:1075:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=16174) [2025-10-20 21:45:45,786] LMCache INFO: Stored 6005 out of total 6005 tokens. size: 0.6414 gb, cost 57.6882 ms, throughput: 11.1184 GB/s; offload_time: 57.5493 ms, put_time: 0.1390 ms (cache_engine.py:288:lmcache.v1.cache_engine)
Processed prompts: 100%|██| 1/1 [00:00<00:00,  3.87it/s, est. speed input: 23251.10 toks/s, output: 38.72 toks/s]
--------------------------------------------------
Generated text: '...Hello, my name is...Hello, my'
Generation took 0.28 seconds, first request done.
--------------------------------------------------
Adding requests: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 66.55it/s]
Processed prompts:   0%|               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=16174) [2025-10-20 21:45:46,877] LMCache INFO: Reqid: 1, Total tokens 6006, LMCache hit tokens: 5888, need to load: -112 (vllm_v1_adapter.py:1191:lmcache.integration.vllm.vllm_v1_adapter)
Processed prompts: 100%|| 1/1 [00:00<00:00, 12.79it/s, est. speed input: 77081.69 toks/s, output: 128.31 toks/s]
--------------------------------------------------
Generated text: ' about a person who is a doctor. I need'
Generation took 0.09 seconds, second request done.

after
After clearing the prefix cache from the first request, the logs will show that the second request hits and loads the KV cache from LMCache.

INFO 10-20 21:42:35 [llm.py:306] Supported_tasks: ['generate']
Adding requests: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.59it/s]
Processed prompts:   0%|               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=15630) [2025-10-20 21:42:35,418] LMCache INFO: Reqid: 0, Total tokens 6005, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1191:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=15630) [2025-10-20 21:42:35,427] LMCache INFO: Post-initializing LMCacheEngine (cache_engine.py:170:lmcache.v1.cache_engine)
(EngineCore_DP0 pid=15630) [2025-10-20 21:42:35,537] LMCache INFO: Storing KV cache for 6005 out of 6005 tokens (skip_leading_tokens=0) for request 0 (vllm_v1_adapter.py:1075:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=15630) [2025-10-20 21:42:35,592] LMCache INFO: Stored 6005 out of total 6005 tokens. size: 0.6414 gb, cost 55.4192 ms, throughput: 11.5737 GB/s; offload_time: 55.3302 ms, put_time: 0.0890 ms (cache_engine.py:288:lmcache.v1.cache_engine)
Processed prompts: 100%|██| 1/1 [00:00<00:00,  4.20it/s, est. speed input: 25241.92 toks/s, output: 42.03 toks/s]
--------------------------------------------------
Generated text: '...Hello, my name is...Hello, my'
Generation took 0.26 seconds, first request done.
--------------------------------------------------
(EngineCore_DP0 pid=15630) INFO 10-20 21:42:36 [block_pool.py:378] Successfully reset prefix cache
Adding requests: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 69.05it/s]
Processed prompts:   0%|               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=15630) [2025-10-20 21:42:36,674] LMCache INFO: Reqid: 1, Total tokens 6006, LMCache hit tokens: 5888, need to load: 5888 (vllm_v1_adapter.py:1191:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=15630) [2025-10-20 21:42:36,732] LMCache INFO: Retrieved 5888 out of 5888 required tokens (from 5888 total tokens). size: 0.6289 gb, cost 55.2204 ms, throughput: 11.3890 GB/s; (cache_engine.py:509:lmcache.v1.cache_engine)
Processed prompts: 100%|██| 1/1 [00:00<00:00,  7.64it/s, est. speed input: 46018.34 toks/s, output: 76.61 toks/s]
--------------------------------------------------
Generated text: ' about a person who is a doctor. I need'
Generation took 0.15 seconds, second request done.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Oct 21, 2025

Documentation preview: https://vllm--27248.org.readthedocs.build/en/27248/

@mergify mergify bot added documentation Improvements or additions to documentation kv-connector labels Oct 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds a call to reset_prefix_cache in the LMCache CPU offload example to ensure both save and load operations are demonstrated. This is a good improvement for the example's clarity. I've added one suggestion to make the example more robust by checking the return value of the cache reset operation.

time.sleep(1)
# Clear vLLM's internal prefix cache to force the second request
# to fetch cached KVs from LMCache
llm.reset_prefix_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The reset_prefix_cache method can fail if there are still blocks in use, and it returns a boolean indicating success or failure. The example should check this return value and raise an error if it's False. This will make the example more robust and prevent confusion if the cache reset fails, which would cause the example to not demonstrate the intended LMCache loading behavior.

Suggested change
llm.reset_prefix_cache()
if not llm.reset_prefix_cache():
raise RuntimeError(
"Failed to reset prefix cache. The example may not run as expected."
)

Signed-off-by: zhou.qianjun <zhou.qianjun@zte.com.cn>
@sakunkun sakunkun marked this pull request as draft October 23, 2025 06:32
@sakunkun sakunkun marked this pull request as ready for review October 23, 2025 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant