Skip to content

Optimize gemv_n_sve_v1x3 kernel #5292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

isharif168
Copy link

@isharif168 isharif168 commented Jun 9, 2025

  • Calculate predicate outside the loop
  • Divide matrix in blocks of 3

Comparison

x-axis -> M = N
y-axis -> GFLOPS (timing)

pg00 = svand_z(SV_TRUE(), pg0, pg00);
pg01 = svand_z(SV_TRUE(), pg0, pg01);
pg02 = svand_z(SV_TRUE(), pg0, pg02);
svbool_t pg_tail = SV_WHILE(i, m);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to pre-calculate this predicate outside of the loop ?
This is re-used again below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since we are calculating the predicate for the tail elements , it depends on i value , so if we remove outside of the loop then we have to calculate for (0 , m % sve_size) but that can go wrong sometime , since we want from (i, m) and not from 0 , whats your thought on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why (0 , m % sve_size) wouldn't work since we increment i by sve_size in the main loop. Please also soo https://github.com/OpenMathLib/OpenBLAS/pull/5089/files#diff-d0b63f332b08eef9b57a1eec785ff43afc468108c60f237b0c4e9401df08b510R68

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will work, but its just that it will give the predicate for (0 , m % sve_size) index rather than the correct index as (i , m) , sure will make this change , Thanks.

Copy link

codspeed-hq bot commented Jun 11, 2025

CodSpeed Performance Report

Merging #5292 will improve performances by 10.54%

Comparing isharif168:optimized_gemv_n_1x3 (1ed7eb6) with develop (02267d8)

Summary

⚡ 1 improvements
✅ 61 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_dgemv[1000-s] 7.7 ms 7 ms +10.54%

- Calculate predicate outside the loop
- Divide matrix in blocks of 3
@isharif168 isharif168 force-pushed the optimized_gemv_n_1x3 branch from 1ed7eb6 to 8279e68 Compare June 11, 2025 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants