A simple fix for getting NaN with performer attention when using torch.amp. #9916

alonejxy · 2025-01-03T17:34:28Z

alonejxy
Jan 3, 2025

from torch_geometric.nn.attention import PerformerAttention

My model uses this attention mechanism, and when I perform mixed precision training with torch.amp, I get NaN loss. After investigation, the issue was pinpointed to the PerformerAttention; after training on only dozens of batches of data, the attention computation resulted in NaN.

By consulting the torch.amp documentation, I adopted a simple solution:
In the file: torch_geometric/nn/attention/performer.py

from torch.amp import custom_fwd

...

class PerformerAttention(torch.nn.Module):
    ...

    # add this ↓ before the forward line.
    @custom_fwd(device_type='cuda', cast_inputs=torch.float32)
    def forward(self, x: Tensor, mask: Optional[Tensor] = None) -> Tensor:

This method prevents torch.amp.autocast from converting PerformerAttention to half precision.

Although there is no acceleration for the attention mechanism, torch.amp can still speed up other parts of my model, resulting in an overall reduction in training time.

This method requires manually specifying the device. Does anyone know how to automatically specify the device?
I tried using device_type=next(self.parameters()).device, but it couldn't access the self parameter, which resulted in the error 'NameError: name 'self' is not defined'.

My environment：

torch                     2.5.1+cu124
torch-cluster             1.6.3+pt25cu124
torch-geometric           2.6.1
torch-scatter             2.1.2+pt25cu124
torch-sparse              0.6.18+pt25cu124
torch-spline-conv         1.2.2+pt25cu124

akihironitta · 2025-01-03T22:23:09Z

akihironitta
Jan 3, 2025
Maintainer

Have you had a chance to try this?

-    @custom_fwd(device_type='cuda', cast_inputs=torch.float32)
+    @custom_fwd(device_type='cuda' if torch.cuda.is_available() else 'cpu', cast_inputs=torch.float32)

2 replies

alonejxy Jan 4, 2025
Author

I mean that it should be able to automatically recognize whether the model has been moved to CUDA or CPU, as some people might run it on the CPU instead.

akihironitta Jan 4, 2025
Maintainer

I'd suggest submitting a feature request to PyTorch then :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A simple fix for getting NaN with performer attention when using torch.amp. #9916

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

A simple fix for getting NaN with performer attention when using torch.amp. #9916

alonejxy Jan 3, 2025

Replies: 1 comment · 2 replies

akihironitta Jan 3, 2025 Maintainer

alonejxy Jan 4, 2025 Author

akihironitta Jan 4, 2025 Maintainer

alonejxy
Jan 3, 2025

Replies: 1 comment 2 replies

akihironitta
Jan 3, 2025
Maintainer

alonejxy Jan 4, 2025
Author

akihironitta Jan 4, 2025
Maintainer