CUDA: Implemented row flattening for non-glm RoPE #2468

JohannesGaessler · 2023-07-31T11:04:22Z

This PR adds support for flattening tensors for the non-glm CUDA RoPE implementation. This reduces the associated kernel launch overhead by a factor of 512 for prompt processing. There is no difference for token generation:

GPU	Model	Test	t/s master	t/s PR	Speedup
RTX 3090	7b q4_0	pp	1347	1550	1.15
RTX 3090	13b q4_0	pp	773	851	1.1
RTX 3090	33b q4_0	pp	325	344	1.06
RTX 3090	7b q4_0	tg128	131.66	130.78	0.99
RTX 3090	13b q4_0	tg128	73.50	73.14	1
RTX 3090	33b q4_0	tg128	33.25	33.10	1
P40	7b q4_0	pp	624	665	1.07
P40	13b q4_0	pp	348	364	1.05
P40	33b q4_0	pp	148	150	1.01
P40	7b q4_0	tg128	50.54	50.62	1
P40	13b q4_0	tg128	27.89	27.84	1
P40	33b q4_0	tg128	12.08	12.07	1

slaren · 2023-07-31T11:49:34Z

Seems to improve performance under WSL even more, from 0.48 seconds per pass to 0.36 seconds per pass.

CUDA: Implemented row flattening for non-glm RoPE

58ff5e1

slaren approved these changes Jul 31, 2023

View reviewed changes

JohannesGaessler merged commit 1215ed7 into ggerganov:master Jul 31, 2023
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Implemented row flattening for non-glm RoPE #2468

CUDA: Implemented row flattening for non-glm RoPE #2468

JohannesGaessler commented Jul 31, 2023 •

edited

Loading

slaren commented Jul 31, 2023

CUDA: Implemented row flattening for non-glm RoPE #2468

CUDA: Implemented row flattening for non-glm RoPE #2468

Conversation

JohannesGaessler commented Jul 31, 2023 • edited Loading

slaren commented Jul 31, 2023

JohannesGaessler commented Jul 31, 2023 •

edited

Loading