Skip to content

[Question] Flash attention only applies to prefilling stage #147

Open
@KexinFeng

Description

@KexinFeng

I have a question arising from reading the code. I notice that in ~/lightllm/models/llama2/layer_infer/transformer_layer_infer.py, the flash attention is only applied to the prefilling stage, i.e. the context_attention_fwd, but not to the decoding stage, i.e. token_att_fwd. Am I correct in this understanding?

In principle, token attention doesn't conflict with flash attention. Do you plan to combine them both in the decoding stage?

Also, what is the obstacle of directly using flash attention repo with the token-level memory management?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions