[Feature]: Support for Triton attention backend for inference

### 🚀 The feature, motivation and pitch

Currently, PagedAttention only supports specific head_size values. This prevents models like Magistral 2509 (with a head_size of 160) from running. vLLM resolves this by using Triton as the inference backend instead of PagedAttention in these situations.

I recommend providing Triton as an alternative in situations where PagedAttention is not suitable for running a model.

### Alternatives

Don't support a range of models with head_size values unsupported by PagedAttention.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Support for Triton attention backend for inference #1544

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Support for Triton attention backend for inference #1544

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions