You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to add a proprietary model (a pretty standard LLM with hf model.py) to VLLM, I finished the porting script but the generation results does not match and goes terribly wrong. Then I start to comparing the output of the VLLM model and the HF model side by side.
I noticed that everything is correct except when calling the flash_attn_with_kvcache function. This is very strange behavior since the result of flash_attn_varlen_func function call does match the HF output. Let's say you have a prompt ABC, and HF model completes it with DEF. If you run the prompt on VLLM model, the first generated result is D, but then with some random stuff. If you change prompt to ABCD, VLLM can generate E correctly and then followed with random stuff.
It seems the qkv passed into FlashAttentionImpl.forward() all matches the HF values. Seems to me the kv cache stored in the paged attention has something wrong.
Any ideas on how to debug this issue? Currently I am having difficult to slice the correct indices from the paged attention data (kv_cache). I know I can use `block_tables' to select the block, but now sure how to select the relevant data inside a block. And let's say if I found the kv_cache value does not match HF's past_key_and_values, what would be the possible issues?
Any help would be much appreciated.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Anything you want to discuss about vllm.
I am trying to add a proprietary model (a pretty standard LLM with hf model.py) to VLLM, I finished the porting script but the generation results does not match and goes terribly wrong. Then I start to comparing the output of the VLLM model and the HF model side by side.
I noticed that everything is correct except when calling the
flash_attn_with_kvcache
function. This is very strange behavior since the result offlash_attn_varlen_func
function call does match the HF output. Let's say you have a prompt ABC, and HF model completes it with DEF. If you run the prompt on VLLM model, the first generated result is D, but then with some random stuff. If you change prompt to ABCD, VLLM can generate E correctly and then followed with random stuff.It seems the qkv passed into
FlashAttentionImpl.forward()
all matches the HF values. Seems to me the kv cache stored in the paged attention has something wrong.Any ideas on how to debug this issue? Currently I am having difficult to slice the correct indices from the paged attention data (kv_cache). I know I can use `block_tables' to select the block, but now sure how to select the relevant data inside a block. And let's say if I found the kv_cache value does not match HF's past_key_and_values, what would be the possible issues?
Any help would be much appreciated.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: