Skip to content

Request: Optimized attention backend (fwd/bwd) support for GPT-OSS with learnable sink #831

@CPFLAME

Description

@CPFLAME

Hi team,
I’d like to ask whether there is forward and backward support for GPT-OSS attention when using a learnable sink.

In GPT-OSS, the attention module includes a learnable sink parameter. As a result, we currently have to train in the eager attention mode. This makes memory usage scale quadratically with sequence length, which prevents training on long contexts. I found that some kernels provide forward support for a learnable sink, but I couldn’t find full backward support, so we still can’t use an optimized backend end-to-end for training.

Questions:

  • Is there an existing or planned roadmap to support learnable-sink attention in both forward and backward for optimized backends on recent GPUs?
  • If yes, could you share an ETA or pointers to the relevant branch/PR?
  • A minimal example showing how to enable this with GPT-OSS (model/config flags, attention impl selection) would be very helpful.

Environment (for reference):

Model: GPT-OSS (20B,120B)
GPU: H20

Thanks a lot!

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions