Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: fewer memory bank conflicts for mul_mat_q #2458

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Jul 30, 2023

This PR replaces the simple code for loading data into shared memory tiles for mul_mat_q kernels with mode complicated code that reduces memory bank conflicts as much as possible. The exception is q4_0 and q8_0 where the performance was slightly worse (I'm keeping the corresponding code commented out because it will probably be useful when employing asynchronous data loading). These are the performance numbers on my system:

GPU Model Test t/s master t/s PR Speedup
RTX 3090 7b q4_0 pp 1329 1372 1.03
RTX 3090 7b q4_1 pp 1275 1261 0.99
RTX 3090 7b q5_0 pp 973 978 1.01
RTX 3090 7b q5_1 pp 897 956 1.07
RTX 3090 7b q8_0 pp 1068 1092 1.02
RTX 3090 7b q2_k pp 642 680 1.06
RTX 3090 7b q3_k_s pp 473 537 1.14
RTX 3090 7b q4_k_s pp 835 870 1.04
RTX 3090 7b q5_k_s pp 516 534 1.03
RTX 3090 7b q6_k pp 586 649 1.11
P40 7b q4_0 pp 624 656 1.05
P40 7b q4_1 pp 360 436 1.21
P40 7b q5_0 pp 246 386 1.57
P40 7b q5_1 pp 254 365 1.44
P40 7b q8_0 pp 529 541 1.02
P40 7b q2_k pp 209 236 1.13
P40 7b q3_k_s pp 120 201 1.68
P40 7b q4_k_s pp 228 238 1.04
P40 7b q5_k_s pp 200 206 1.03
P40 7b q6_k pp 161 236 1.47

For reference: cuBLAS performance is ~1300 t/s on an RTX 3090 and ~460 t/s on a P40.

@slaren
Copy link
Collaborator

slaren commented Jul 31, 2023

Do the __builtin_assume make a difference in the performance?

@JohannesGaessler
Copy link
Collaborator Author

They definitely do for the more complicated kernels where they allow the compiler to optimize out some instructions that can have no effect depending on tile size. I added them to the more simple kernels too because that was less work than to manually check performance for each kernel.

@JohannesGaessler JohannesGaessler merged commit 2dbf518 into ggerganov:master Jul 31, 2023
25 checks passed
@whoreson
Copy link
Contributor

error: identifier "__builtin_assume" is undefined

This is a somewhat older CUDA but worked fine so far, please review.

@JohannesGaessler
Copy link
Collaborator Author

If you want help, provide more detailed debugging information.

@whoreson
Copy link
Contributor

Yeah, but I don't know the first thing about cuda/cublas.

It's this package: libcublas10:amd64 10.1.243-3

I guess just install it, and you'll probably see why it doesn't compile.

@cebtenzzre
Copy link
Collaborator

__builtin_assume is only available as of CUDA 11.2. I suppose you could write a macro around __builtin_unreachable.

@whoreson
Copy link
Contributor

whoreson commented Oct 9, 2023

Due to recent changes, another regression occured:

ggml-cuda.cu(6299): error: identifier "CUBLAS_COMPUTE_16F" is undefined

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants