[MISC] Add prefix cache reset to LMCache CPU offload example #27248

sakunkun · 2025-10-21T06:39:14Z

Purpose

To trigger both LMCache save and load operations in this example.

Test Plan

Test Result

before
The second request only triggers a cache hit in LMCache, still using the local cache.

INFO 10-20 21:45:45 [llm.py:306] Supported_tasks: ['generate']
Adding requests: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.45it/s]
Processed prompts:   0%|               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=16174) [2025-10-20 21:45:45,601] LMCache INFO: Reqid: 0, Total tokens 6005, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1191:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=16174) [2025-10-20 21:45:45,609] LMCache INFO: Post-initializing LMCacheEngine (cache_engine.py:170:lmcache.v1.cache_engine)
(EngineCore_DP0 pid=16174) [2025-10-20 21:45:45,728] LMCache INFO: Storing KV cache for 6005 out of 6005 tokens (skip_leading_tokens=0) for request 0 (vllm_v1_adapter.py:1075:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=16174) [2025-10-20 21:45:45,786] LMCache INFO: Stored 6005 out of total 6005 tokens. size: 0.6414 gb, cost 57.6882 ms, throughput: 11.1184 GB/s; offload_time: 57.5493 ms, put_time: 0.1390 ms (cache_engine.py:288:lmcache.v1.cache_engine)
Processed prompts: 100%|██| 1/1 [00:00<00:00,  3.87it/s, est. speed input: 23251.10 toks/s, output: 38.72 toks/s]
--------------------------------------------------
Generated text: '...Hello, my name is...Hello, my'
Generation took 0.28 seconds, first request done.
--------------------------------------------------
Adding requests: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 66.55it/s]
Processed prompts:   0%|               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=16174) [2025-10-20 21:45:46,877] LMCache INFO: Reqid: 1, Total tokens 6006, LMCache hit tokens: 5888, need to load: -112 (vllm_v1_adapter.py:1191:lmcache.integration.vllm.vllm_v1_adapter)
Processed prompts: 100%|█| 1/1 [00:00<00:00, 12.79it/s, est. speed input: 77081.69 toks/s, output: 128.31 toks/s]
--------------------------------------------------
Generated text: ' about a person who is a doctor. I need'
Generation took 0.09 seconds, second request done.

after
After clearing the prefix cache from the first request, the logs will show that the second request hits and loads the KV cache from LMCache.

INFO 10-20 21:42:35 [llm.py:306] Supported_tasks: ['generate']
Adding requests: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.59it/s]
Processed prompts:   0%|               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=15630) [2025-10-20 21:42:35,418] LMCache INFO: Reqid: 0, Total tokens 6005, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1191:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=15630) [2025-10-20 21:42:35,427] LMCache INFO: Post-initializing LMCacheEngine (cache_engine.py:170:lmcache.v1.cache_engine)
(EngineCore_DP0 pid=15630) [2025-10-20 21:42:35,537] LMCache INFO: Storing KV cache for 6005 out of 6005 tokens (skip_leading_tokens=0) for request 0 (vllm_v1_adapter.py:1075:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=15630) [2025-10-20 21:42:35,592] LMCache INFO: Stored 6005 out of total 6005 tokens. size: 0.6414 gb, cost 55.4192 ms, throughput: 11.5737 GB/s; offload_time: 55.3302 ms, put_time: 0.0890 ms (cache_engine.py:288:lmcache.v1.cache_engine)
Processed prompts: 100%|██| 1/1 [00:00<00:00,  4.20it/s, est. speed input: 25241.92 toks/s, output: 42.03 toks/s]
--------------------------------------------------
Generated text: '...Hello, my name is...Hello, my'
Generation took 0.26 seconds, first request done.
--------------------------------------------------
(EngineCore_DP0 pid=15630) INFO 10-20 21:42:36 [block_pool.py:378] Successfully reset prefix cache
Adding requests: 100%|█████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 69.05it/s]
Processed prompts:   0%|               | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=15630) [2025-10-20 21:42:36,674] LMCache INFO: Reqid: 1, Total tokens 6006, LMCache hit tokens: 5888, need to load: 5888 (vllm_v1_adapter.py:1191:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=15630) [2025-10-20 21:42:36,732] LMCache INFO: Retrieved 5888 out of 5888 required tokens (from 5888 total tokens). size: 0.6289 gb, cost 55.2204 ms, throughput: 11.3890 GB/s; (cache_engine.py:509:lmcache.v1.cache_engine)
Processed prompts: 100%|██| 1/1 [00:00<00:00,  7.64it/s, est. speed input: 46018.34 toks/s, output: 76.61 toks/s]
--------------------------------------------------
Generated text: ' about a person who is a doctor. I need'
Generation took 0.15 seconds, second request done.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-10-21T06:39:59Z

Documentation preview: https://vllm--27248.org.readthedocs.build/en/27248/

gemini-code-assist

Code Review

This pull request correctly adds a call to reset_prefix_cache in the LMCache CPU offload example to ensure both save and load operations are demonstrated. This is a good improvement for the example's clarity. I've added one suggestion to make the example more robust by checking the return value of the cache reset operation.

gemini-code-assist · 2025-10-21T06:40:08Z

examples/others/lmcache/cpu_offload_lmcache.py

        time.sleep(1)
+        # Clear vLLM's internal prefix cache to force the second request
+        # to fetch cached KVs from LMCache
+        llm.reset_prefix_cache()


The reset_prefix_cache method can fail if there are still blocks in use, and it returns a boolean indicating success or failure. The example should check this return value and raise an error if it's False. This will make the example more robust and prevent confusion if the cache reset fails, which would cause the example to not demonstrate the intended LMCache loading behavior.

Suggested change

llm.reset_prefix_cache()

if not llm.reset_prefix_cache():

raise RuntimeError(

"Failed to reset prefix cache. The example may not run as expected."

)

Signed-off-by: zhou.qianjun <zhou.qianjun@zte.com.cn>

mergify bot added documentation Improvements or additions to documentation kv-connector labels Oct 21, 2025

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

[MISC] Add prefix cache reset to LMCache CPU offload example

0a3940f

Signed-off-by: zhou.qianjun <zhou.qianjun@zte.com.cn>

sakunkun force-pushed the main branch from 8b1b61c to 0a3940f Compare October 21, 2025 06:56

Merge branch 'main' into main

21158af

sakunkun marked this pull request as draft October 23, 2025 06:32

sakunkun marked this pull request as ready for review October 23, 2025 06:35

sakunkun mentioned this pull request Oct 27, 2025

[KV offload][4/N] Offloading KV connector #22595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MISC] Add prefix cache reset to LMCache CPU offload example #27248

[MISC] Add prefix cache reset to LMCache CPU offload example #27248

sakunkun commented Oct 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        llm.reset_prefix_cache()
+        if not llm.reset_prefix_cache():
+            raise RuntimeError(
+                "Failed to reset prefix cache. The example may not run as expected."
+            )

Uh oh!

[MISC] Add prefix cache reset to LMCache CPU offload example #27248

Are you sure you want to change the base?

[MISC] Add prefix cache reset to LMCache CPU offload example #27248

Conversation

sakunkun commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sakunkun commented Oct 21, 2025 •

edited by github-actions bot

Loading