Skip to content

[Feature] 为什么不用pytorch原生的sdpa,反而用flash attention呢? #519

@yangtian6781

Description

@yangtian6781

Motivation

想问一下internvl为什么要用flash attention而不用sdpa呢,这个pr上面说sdpa是要比flash attention快的:https://github.com/huggingface/transformers/pull/31940#issuecomment-2228246233,我把internvl的flash attention替换成sdpa后,发现模型最后的输出会有差异,模型会有不同的回答,但sdpa输出的logits值和fa的logits余弦相似度在0.99以上,替换成sdpa去复现模型会是一个很大的issue吗?

Related resources

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions