Skip to content

Unroll the tail loop of SME kernels#9609

Merged
copybara-service[bot] merged 1 commit intomasterfrom
test_876449775
Mar 1, 2026
Merged

Unroll the tail loop of SME kernels#9609
copybara-service[bot] merged 1 commit intomasterfrom
test_876449775

Conversation

@copybara-service
Copy link
Contributor

@copybara-service copybara-service bot commented Feb 27, 2026

Unroll the tail loop of SME kernels

This speeds up cases where N is not a multiple of svl * 4.

This is part of #9531 by @kasper0406. I tried explicitly skipping the ops if the mask was empty like #9531 did, but I found no different in performance, and it's simpler to just let the mask handle it.

After this change, the main loop and tail case are almost identical, it would be nice to find a way to deduplicate them.

Before:

-------------------------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------
dot/dot_fp32_sme2/real_time                 22370 ns        22369 ns         6263 OP=1.23594T/s 240x240x240
dot/dot_bf16_bf16_fp32_sme2/real_time       21820 ns        21820 ns         6406 OP=1.2671T/s 240x240x240
dot/dot_fp16_fp16_fp32_sme2/real_time       21806 ns        21806 ns         6428 OP=1.26791T/s 240x240x240
dot/dot_int8_int8_int32_sme2/real_time       8207 ns         8206 ns        17055 OP=3.36887T/s 240x240x240
dot/dot_fp32_sme/real_time                  22043 ns        22041 ns         6342 OP=1.25427T/s 240x240x240
dot/dot_bf16_bf16_fp32_sme/real_time        21541 ns        21540 ns         6476 OP=1.28351T/s 240x240x240
dot/dot_fp16_fp16_fp32_sme/real_time        21556 ns        21555 ns         6505 OP=1.28264T/s 240x240x240
dot/dot_int8_int8_int32_sme/real_time        8206 ns         8206 ns        17058 OP=3.36911T/s 240x240x240

After:

-------------------------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------
dot/dot_fp32_sme2/real_time                 16532 ns        16532 ns         8468 OP=1.6724T/s 240x240x240
dot/dot_bf16_bf16_fp32_sme2/real_time       15666 ns        15661 ns         8897 OP=1.76481T/s 240x240x240
dot/dot_fp16_fp16_fp32_sme2/real_time       15562 ns        15562 ns         8991 OP=1.7766T/s 240x240x240
dot/dot_int8_int8_int32_sme2/real_time       7796 ns         7795 ns        17819 OP=3.54665T/s 240x240x240
dot/dot_fp32_sme/real_time                  16349 ns        16349 ns         8553 OP=1.6911T/s 240x240x240
dot/dot_bf16_bf16_fp32_sme/real_time        15596 ns        15594 ns         8978 OP=1.77273T/s 240x240x240
dot/dot_fp16_fp16_fp32_sme/real_time        15581 ns        15578 ns         8996 OP=1.77443T/s 240x240x240
dot/dot_int8_int8_int32_sme/real_time        7779 ns         7779 ns        17917 OP=3.55437T/s 240x240x240

It's about a 1.4x speedup for some of these cases.

A similar opportunity probably exists for Intel AMX too, though it might be trickier to implement.

This speeds up cases where N is not a multiple of svl * 4.

This is part of #9531 by @kasper0406. I tried explicitly skipping the ops if the mask was empty like #9531 did, but I found no different in performance, and it's simpler to just let the mask handle it.

After this change, the main loop and tail case are almost identical, it would be nice to find a way to deduplicate them.

Before:
```
-------------------------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------
dot/dot_fp32_sme2/real_time                 22370 ns        22369 ns         6263 OP=1.23594T/s 240x240x240
dot/dot_bf16_bf16_fp32_sme2/real_time       21820 ns        21820 ns         6406 OP=1.2671T/s 240x240x240
dot/dot_fp16_fp16_fp32_sme2/real_time       21806 ns        21806 ns         6428 OP=1.26791T/s 240x240x240
dot/dot_int8_int8_int32_sme2/real_time       8207 ns         8206 ns        17055 OP=3.36887T/s 240x240x240
dot/dot_fp32_sme/real_time                  22043 ns        22041 ns         6342 OP=1.25427T/s 240x240x240
dot/dot_bf16_bf16_fp32_sme/real_time        21541 ns        21540 ns         6476 OP=1.28351T/s 240x240x240
dot/dot_fp16_fp16_fp32_sme/real_time        21556 ns        21555 ns         6505 OP=1.28264T/s 240x240x240
dot/dot_int8_int8_int32_sme/real_time        8206 ns         8206 ns        17058 OP=3.36911T/s 240x240x240
```

After:
```
-------------------------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------
dot/dot_fp32_sme2/real_time                 16532 ns        16532 ns         8468 OP=1.6724T/s 240x240x240
dot/dot_bf16_bf16_fp32_sme2/real_time       15666 ns        15661 ns         8897 OP=1.76481T/s 240x240x240
dot/dot_fp16_fp16_fp32_sme2/real_time       15562 ns        15562 ns         8991 OP=1.7766T/s 240x240x240
dot/dot_int8_int8_int32_sme2/real_time       7796 ns         7795 ns        17819 OP=3.54665T/s 240x240x240
dot/dot_fp32_sme/real_time                  16349 ns        16349 ns         8553 OP=1.6911T/s 240x240x240
dot/dot_bf16_bf16_fp32_sme/real_time        15596 ns        15594 ns         8978 OP=1.77273T/s 240x240x240
dot/dot_fp16_fp16_fp32_sme/real_time        15581 ns        15578 ns         8996 OP=1.77443T/s 240x240x240
dot/dot_int8_int8_int32_sme/real_time        7779 ns         7779 ns        17917 OP=3.55437T/s 240x240x240
```

It's about a 1.4x speedup for some of these cases.

A similar opportunity probably exists for Intel AMX too, though it might be trickier to implement.

PiperOrigin-RevId: 877092621
@copybara-service copybara-service bot merged commit 406cc2d into master Mar 1, 2026
@copybara-service copybara-service bot deleted the test_876449775 branch March 1, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant