Skip to content

[Feature]: Support for Triton attention backend for inference #1544

Description

@stepfunction83

🚀 The feature, motivation and pitch

Currently, PagedAttention only supports specific head_size values. This prevents models like Magistral 2509 (with a head_size of 160) from running. vLLM resolves this by using Triton as the inference backend instead of PagedAttention in these situations.

I recommend providing Triton as an alternative in situations where PagedAttention is not suitable for running a model.

Alternatives

Don't support a range of models with head_size values unsupported by PagedAttention.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions