[Misc]: a question about chunked-prefill in flash-attn backends #4863
Closed as not planned
Description
Anything you want to discuss about vllm.
vllm/vllm/attention/backends/flash_attn.py
Line 282 in 99caa49
I noticed that in flash-attn backends. forward_prefix
and forward_decode
seem to be executed serially. Does forward_decode
wait for forward_prefix
to finish before running? Can this take advantage of the performance provided by chunked-prefill? I mean the tokens of prefill are in the same batch as the tokens of decode.
if prefill_meta := attn_metadata.prefill_metadata:
output[:num_prefill_tokens] = PagedAttention.forward_prefix(...)
if decode_meta := attn_metadata.decode_metadata:
output[num_prefill_tokens:] = PagedAttention.forward_decode(...)