[Question] Flash attention only applies to prefilling stage

I have a question arising from reading the code. I notice that in `~/lightllm/models/llama2/layer_infer/transformer_layer_infer.py`, the flash attention is only applied to the prefilling stage, i.e. the `context_attention_fwd`, but not to the decoding stage, i.e. `token_att_fwd`. Am I correct in this understanding?

In principle, token attention doesn't conflict with flash attention. Do you plan to combine them both in the decoding stage?

Also, what is the obstacle of directly using [flash attention repo](https://github.com/Dao-AILab/flash-attention) with the token-level memory management?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Flash attention only applies to prefilling stage #147

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Flash attention only applies to prefilling stage #147

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions