[Misc]: Debugging the paged attention issue for customized LLMs #9231

protossw512 · 2024-10-10T07:27:54Z

Anything you want to discuss about vllm.

I am trying to add a proprietary model (a pretty standard LLM with hf model.py) to VLLM, I finished the porting script but the generation results does not match and goes terribly wrong. Then I start to comparing the output of the VLLM model and the HF model side by side.
I noticed that everything is correct except when calling the flash_attn_with_kvcache function. This is very strange behavior since the result of flash_attn_varlen_func function call does match the HF output. Let's say you have a prompt ABC, and HF model completes it with DEF. If you run the prompt on VLLM model, the first generated result is D, but then with some random stuff. If you change prompt to ABCD, VLLM can generate E correctly and then followed with random stuff.
It seems the qkv passed into FlashAttentionImpl.forward() all matches the HF values. Seems to me the kv cache stored in the paged attention has something wrong.

Any ideas on how to debug this issue? Currently I am having difficult to slice the correct indices from the paged attention data (kv_cache). I know I can use `block_tables' to select the block, but now sure how to select the relevant data inside a block. And let's say if I found the kv_cache value does not match HF's past_key_and_values, what would be the possible issues?

Any help would be much appreciated.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

protossw512 added the misc label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc]: Debugging the paged attention issue for customized LLMs #9231

[Misc]: Debugging the paged attention issue for customized LLMs #9231

protossw512 commented Oct 10, 2024 •

edited

Loading

[Misc]: Debugging the paged attention issue for customized LLMs #9231

[Misc]: Debugging the paged attention issue for customized LLMs #9231

Comments

protossw512 commented Oct 10, 2024 • edited Loading

Anything you want to discuss about vllm.

Before submitting a new issue...

protossw512 commented Oct 10, 2024 •

edited

Loading