Skip to content

Conversation

@mengniwang95
Copy link
Contributor

Type of Change

workaround

Description

Update PatchedVLLMKVCache for deepseek performance

xuechendi pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 4, 2025
Previously, when we use INC to convert deepseek FP8 model, we need this
[commit
](intel/neural-compressor@7c0a3e2)
to remove extra converts in KVCache but actually GC can remove them
during graph optimization theoretically.
Furthermore, the change in commit is not aligned with the design of INC
patched module, which wants to keep the returned tensor BF16 because we
can't make sure users' next operation.
So, I update the modeling file to make GC can work for patched KVCache
pattern of deepseek model.
Since next release is very close and GC currently can not work as
expection during decode stage, it is still a workround. We will root
cause and fix it from source in next relase.

This PR should work together with this PR:
intel/neural-compressor#2165

Signed-off-by: Mengni Wang <mengni.wang@intel.com>
@yiliu30 yiliu30 merged commit fcf3031 into r1-woq Apr 5, 2025
7 of 9 checks passed
@yiliu30 yiliu30 deleted the dev/mengni/kv branch April 5, 2025 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants