Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Implemented row flattening for non-glm RoPE #2468

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Jul 31, 2023

This PR adds support for flattening tensors for the non-glm CUDA RoPE implementation. This reduces the associated kernel launch overhead by a factor of 512 for prompt processing. There is no difference for token generation:

GPU Model Test t/s master t/s PR Speedup
RTX 3090 7b q4_0 pp 1347 1550 1.15
RTX 3090 13b q4_0 pp 773 851 1.1
RTX 3090 33b q4_0 pp 325 344 1.06
RTX 3090 7b q4_0 tg128 131.66 130.78 0.99
RTX 3090 13b q4_0 tg128 73.50 73.14 1
RTX 3090 33b q4_0 tg128 33.25 33.10 1
P40 7b q4_0 pp 624 665 1.07
P40 13b q4_0 pp 348 364 1.05
P40 33b q4_0 pp 148 150 1.01
P40 7b q4_0 tg128 50.54 50.62 1
P40 13b q4_0 tg128 27.89 27.84 1
P40 33b q4_0 tg128 12.08 12.07 1

@slaren
Copy link
Collaborator

slaren commented Jul 31, 2023

Seems to improve performance under WSL even more, from 0.48 seconds per pass to 0.36 seconds per pass.

@JohannesGaessler JohannesGaessler merged commit 1215ed7 into ggerganov:master Jul 31, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants