Request: Optimized attention backend (fwd/bwd) support for GPT-OSS with learnable sink

Hi team,
I’d like to ask whether there is forward and backward support for GPT-OSS attention when using a learnable sink.

In GPT-OSS, the attention module includes a learnable sink parameter. As a result, we currently have to train in the eager attention mode. This makes memory usage scale quadratically with sequence length, which prevents training on long contexts. I found that some kernels provide forward support for a learnable sink, but I couldn’t find full backward support, so we still can’t use an optimized backend end-to-end for training.

### Questions:
- Is there an existing or planned roadmap to support learnable-sink attention in both forward and backward for optimized backends on recent GPUs?
- If yes, could you share an ETA or pointers to the relevant branch/PR?
- A minimal example showing how to enable this with GPT-OSS (model/config flags, attention impl selection) would be very helpful.

### Environment (for reference):
Model: GPT-OSS (20B,120B)
GPU: H20

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request: Optimized attention backend (fwd/bwd) support for GPT-OSS with learnable sink #831

Questions:

Environment (for reference):

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request: Optimized attention backend (fwd/bwd) support for GPT-OSS with learnable sink #831

Description

Questions:

Environment (for reference):

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions