[Feature] 为什么不用pytorch原生的sdpa，反而用flash attention呢？

### Motivation

想问一下internvl为什么要用flash attention而不用sdpa呢，这个pr上面说sdpa是要比flash attention快的：https://github.com/huggingface/transformers/pull/31940#issuecomment-2228246233，我把internvl的flash attention替换成sdpa后，发现模型最后的输出会有差异，模型会有不同的回答，但sdpa输出的logits值和fa的logits余弦相似度在0.99以上，替换成sdpa去复现模型会是一个很大的issue吗？


### Related resources

_No response_

### Additional context

_No response_