-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL dequant_mul_mat #1459
OpenCL dequant_mul_mat #1459
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a lot of code duplication. Does OpenCL not support templates?
No, it's a C-based language. But you could generate the code as strings as needed and compile at runtime. |
That is unfortunate. Just today I did a PR that deduplicated the CUDA code by creating another template for just dequantization: #1453 . That way you only have to maintain 1 method per quantization + 2 template methods. |
I did some quick testing:
|
There is no need for templates, actually, @JohannesGaessler. I converted your new CUDA code to macros and it still works fine. horrible macros#define DEFINE_Q_FUNCS(NAM, block_size, qk, qr, dequantize_kernel) \
static __global__ void dequantize_block_##NAM(const void * vx, float * y, const int k) { \
const int i = blockDim.x*blockIdx.x + 2*threadIdx.x; \
\
if (i >= k) { \
return; \
} \
\
const int ib = i/qk; /* block index */ \
const int iqs = (i%qk)/qr; /* quant index */ \
const int iybs = i - i%qk; /* y block start index */ \
const int y_offset = qr == 1 ? 1 : qk/2; \
\
/* dequantize */ \
float & v0 = y[iybs + iqs + 0]; \
float & v1 = y[iybs + iqs + y_offset]; \
dequantize_kernel(vx, ib, iqs, v0, v1); \
} \
\
static void dequantize_row_##NAM##_cuda(const void * vx, float * y, const int k, cudaStream_t stream) { \
const int num_blocks = (k + CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / CUDA_DEQUANTIZE_BLOCK_SIZE; \
dequantize_block_##NAM<<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>(vx, y, k); \
} \
\
static __global__ void dequantize_mul_mat_vec_##NAM(const void * vx, const float * y, float * dst, const int ncols) { \
const int row = blockIdx.x; \
const int tid = threadIdx.x; \
\
const int y_offset = qr == 1 ? 1 : qk/2; \
\
__shared__ float tmp[block_size]; /* separate sum for each thread */ \
tmp[tid] = 0; \
\
for (int i = 0; i < ncols/block_size; i += 2) { \
const int col = i*block_size + 2*tid; \
const int ib = (row*ncols + col)/qk; /* block index */ \
const int iqs = (col%qk)/qr; /* quant index */ \
const int iybs = col - col%qk; /* y block start index */ \
\
/* dequantize */ \
float v0, v1; \
dequantize_kernel(vx, ib, iqs, v0, v1); \
\
/* matrix multiplication */ \
tmp[tid] += v0 * y[iybs + iqs + 0]; \
tmp[tid] += v1 * y[iybs + iqs + y_offset]; \
}\
\
/* sum up partial sums and write back result */ \
__syncthreads(); \
for (int s=block_size/2; s>0; s>>=1) { \
if (tid < s) { \
tmp[tid] += tmp[tid + s]; \
} \
__syncthreads(); \
} \
if (tid == 0) { \
dst[row] = tmp[0]; \
} \
} \
\
static void dequantize_mul_mat_vec_##NAM##_cuda(const void * vx, const float * y, float * dst, const int ncols, const int nrows, cudaStream_t stream) { \
GGML_ASSERT(ncols % CUDA_DMMV_BLOCK_SIZE == 0); \
dequantize_mul_mat_vec_##NAM<<<nrows, CUDA_DMMV_BLOCK_SIZE, 0, stream>>>(vx, y, dst, ncols); \
}
DEFINE_Q_FUNCS(q4_0, CUDA_DMMV_BLOCK_SIZE, QK4_0, QR4_0, dequantize_q4_0)
DEFINE_Q_FUNCS(q4_1, CUDA_DMMV_BLOCK_SIZE, QK4_1, QR4_1, dequantize_q4_1)
DEFINE_Q_FUNCS(q5_0, CUDA_DMMV_BLOCK_SIZE, QK5_0, QR5_0, dequantize_q5_0)
DEFINE_Q_FUNCS(q5_1, CUDA_DMMV_BLOCK_SIZE, QK5_1, QR5_1, dequantize_q5_1)
DEFINE_Q_FUNCS(q8_0, CUDA_DMMV_BLOCK_SIZE, QK8_0, QR8_0, dequantize_q8_0)
DEFINE_Q_FUNCS(f16, CUDA_DMMV_BLOCK_SIZE, 32, 1, convert_f16) Currently this wouldn't work in ggml-opencl.c because of the way the string is processed into the file. There can't be any defines in there. |
The runtime compilation of OpenCL gives us the ability to build the kernels on the fly. I plan to add a simple templater that does this to reduce duplication. |
Some tests with (GPU RX 6600 8GB CPU 5600x RAM 32GB 3600mhz) Edit: model Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin There is an error in CMakeLists.txt in OPENCL - Ubuntu
Initializing CLBlast (First Run)... system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 5548,81 ms ROCM - CUDA llama_model_load_internal: [cublas] offloading 31 layers to GPU system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 4835,65 ms OPENCL - WINDOWS Initializing CLBlast (First Run)... system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | llama_print_timings: load time = 9369.12 ms |
@LiliumSancta, which LLaMa is it? BTW, add |
Sorry forgot to mention it is Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin |
@LiliumSancta Thanks for the test. Your results also show that something is wrong with token processing. I will try to investigate that. |
I would like to thank all of you for your efforts in the development of this project. Later when I get off work I'll redo everything with a 7B model that can be loaded entirely into VRAM and using --ignore-eos to be more consistent. |
@0cc4m On WizardLM 7B offloading all the 32 layers I'm getting 60 ms per token on a 5700XT. On the wizard-vicuna-13B model instead I'm just able to offload 27 layers before saturating the VRAM and I get 157 ms per token. Both the models are 5_1 quantized. wizardlm-7B-q5_1:
wizard-vicuna-uncensored-13B-q5_1:
|
I don't think so necessarily, with only 14 tokens, the timing may be very inaccurate. But it does show one thing: OpenCL can be as fast as CUDA/ROCm, and also that CLBlast is seriously underwhelming compared to cuBLAS/rocBLAS. |
Yes, maybe i sent too many layers to the GPU and i believe that could be affecting processing times, so i intend to redo it with a smaller model. I noticed is that CLBLAST allows me to send more layers than fit on the GPU without throwing any errors, but it slows everything down a lot. |
the changes compiled fine, the program runs fine but '--n-gpu-layers N' seems to not make a difference i check nvidia-smi and there is no vram usage difference when I change the number of layers and it seems no faster I am using 'NVIDIA GeForce RTX 3050 Ti Laptop GPU' prompt times without --gpu-layers With: Mabey I did something wrong? |
Results of OpenAssistant-SFT-7-Llama-30B-q5_1 with Radeon 5700XT (8GB) - i7 9700K 32GB ram (3200Mhz) 0 layers:
15 layers:
More than 15 layers saturate the VRAM and slows everything down by a lot. |
Inside windows 11 it's fast, but under linux it's crazy fast. model llama 7B Q4_0 -n 512 --ctx-size 2048 --gpu_layers 99 --seed 1684109798 -t 6 Windows - OPENCL short prompt llama_print_timings: load time = 6061.47 ms long prompt llama_print_timings: load time = 11210.65 ms Ubuntu - OPENCL short prompt llama_print_timings: load time = 4629,94 ms long prompt llama_print_timings: load time = 7222,00 ms Ubuntu - CUDA - ROCM short prompt llama_print_timings: load time = 2580,17 ms long prompt llama_print_timings: load time = 5641,55 ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
Tried this out and I have some findings to share:
Adding some debug printing to the ggml_cl_mul_mat_q_f32 function shows that it executes 4 times before silently crashing. The program exits with no errors and no asserts (even with debug on) To solve that, I backported the cl_sgemm_wrapper from the previous implementation, and I modified ggml_cl_can_mul_mat to trigger ONLY for tensors on device, regardless of batch size. That allowed me to run everything with no issues. These are my benchmarks:
I am really enjoying the practically 100% speed improvement for this, excluding the prompt processing regressions which can be solved by using it only for batchsize=1 tensors on device in the worst case. |
Can confirm the latest fix solves the crash. Is anyone else able to repro the performance regression in the prompt processing? My speed seems equally fast. |
I implemented more CUDA kernels that I will need to fix the memory management issues: #1483 . They do not make a difference for performance. My plan is to directly load the LLaMa parameters from disk to VRAM and if all parameters (including the norms) are loaded this way it will be simpler. |
I think it is ready for review now. I tested on Nvidia and AMD and it worked in all cases for me. I have noticed that there are (weaker) devices that are slightly slower when prompt processing with the non-contiguous tensor loading (like my Steam Deck), but I haven't found the reason for that yet. |
But #1435 should definitely be merged first. I'll adapt this PR once that happens. |
I'll get on it. |
More benchmarks: AMD Accelerated Parallel Processing Device gfx1030 with FP16 off:
|
Also is anyone else having issues with the q8_0 dequantize kernel? I am getting gibberish, but only for that kernel, and only on clblast. Openblas works fine, as does clblast on other kernels. The implementation looks different from the CUDA one but unfortunately I don't understand it enough to review it. |
@SlyEcho I rebased this branch onto your changes, I hope I got everything right. Let me know if you see any mistake. |
Awesome! I only have one nitpick with the |
OK, I found something: Using a F16 model with |
That's an FP16 issue again.. will take a look tomorrow. |
@SlyEcho I found the bug. Please check again if it's fixed for you as well. What nitpick did you have with the printf? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
Otherwise it seems to work. |
@SlyEcho We also noticed that selecting platform and device by id wasn't working as it did before. platform did nothing anymore, as only the device number picked it out of the whole list of devices. I restored the previous behavior for this case, so that both numbers are relevant. I hope that's okay with you. Koboldcpp was already relying on it. If you disagree with this, let me know. Also if you find a mistake. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
I hope you don't mind me just pushing changes. I feel it will go faster this way. |
Thanks, didn't even see that.. Apparently it was too late for the logic side of my brain. |
Device selection logic is good, should be more logical this way. I just didn't figure it out myself. |
I think this PR is ready. |
Let's see what happens, I think it's a big improvement anyway. |
I refactored the OpenCL implementation to become more like the CUDA implementation, including dequant_mul_mat kernels and gpu layers. I see a pretty good speed increase with it.
Not done testing yet, so I put this PR on draft until I'm fully done. But since I've seen others with the same idea, I wanna put it up already to hopefully avoid doing the same work multiple times.
Here are some initial results on Ryzen 9 5950X and AMD RX 6800 XT, q5_1 Llama 7B:
master branch:
This PR (no layers offloaded):
This PR (33 layers offloaded, 4.7GB VRAM used, dequant_mul_mat kernels used):
The slight regression in prompt eval time is the reason I didn't follow the CUDA implementation sooner, it appeared when I implemented the non-contiguous tensor loading. There might be some bottleneck in the code related to that, let me know if you spot something.
Still missing:
CMakeLists.txt updateNvidia testsIntel testsiGPU tests