-
Notifications
You must be signed in to change notification settings - Fork 290
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
Hi team,
I’d like to ask whether there is forward and backward support for GPT-OSS attention when using a learnable sink.
In GPT-OSS, the attention module includes a learnable sink parameter. As a result, we currently have to train in the eager attention mode. This makes memory usage scale quadratically with sequence length, which prevents training on long contexts. I found that some kernels provide forward support for a learnable sink, but I couldn’t find full backward support, so we still can’t use an optimized backend end-to-end for training.
Questions:
- Is there an existing or planned roadmap to support learnable-sink attention in both forward and backward for optimized backends on recent GPUs?
- If yes, could you share an ETA or pointers to the relevant branch/PR?
- A minimal example showing how to enable this with GPT-OSS (model/config flags, attention impl selection) would be very helpful.
Environment (for reference):
Model: GPT-OSS (20B,120B)
GPU: H20
Thanks a lot!
chengtbf, LeiWang1999 and Rachmanino
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested