Closed
Description
Describe the bug
When the fused_attn
is used, the scale
of the attention is not specified in torch.nn.functional.scaled_dot_product_attention
and the value defaults to q.size(-1) ** -0.5
, which is different from the default from the Attention2d
layer (num_heads ** -0.5
).
This means that the results from the fused implementation and the vanilla one are different.
pytorch-image-models/timm/layers/attention2d.py
Lines 294 to 351 in dafe866
Expected behavior
Same results for the two implementations.
Desktop (please complete the following information):
- OS: macOS
- This repository version: 1.0.12
- PyTorch version 2.5 (CPU)
Additional context
Add any other context about the problem here.