Quantized SDPA #1515

barronalex · 2024-10-23T03:38:08Z

First pass at adapting @angeloskath's flash attention to support quantized keys and values.

Still needs some optimization work since it's currently faster to write out the quantized_matmuls rather than use this fused version.

E.g. 4 bit on M2 Ultra for L=32768:

Timing sdpa ... 2.51938 msec
Timing quant_sdpa ... 0.97137 msec
Timing attention ... 1.31419 msec
Timing quant_attention ... 0.92342 msec

bghira · 2025-09-18T04:59:05Z

jfyi i have working int8 and int4 quantised attn, MIT licensed.

barronalex force-pushed the q-sdpa branch from 42a638f to 1e0a199 Compare December 5, 2024 19:10

Alex Barron added 2 commits December 6, 2024 00:21

working qsdpa

12a4d89

add test

3507c10

barronalex force-pushed the q-sdpa branch from 1e0a199 to 3507c10 Compare December 6, 2024 08:45

Alex Barron added 3 commits December 6, 2024 01:09

add checks

c89ddf6

cpu fallback

7697046

fix test

82a956c

awni mentioned this pull request Apr 28, 2025

Missing f8 dtypes #1670

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantized SDPA #1515

Quantized SDPA #1515

Uh oh!

barronalex commented Oct 23, 2024

Uh oh!

bghira commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Quantized SDPA #1515

Are you sure you want to change the base?

Quantized SDPA #1515

Uh oh!

Conversation

barronalex commented Oct 23, 2024

Uh oh!

bghira commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants