Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Merged
merged 1 commit into from
Sep 24, 2024

Conversation

Nekotekina
Copy link
Contributor

@Nekotekina Nekotekina commented Sep 20, 2024

This is a reworked #5653
Some CUDA code was adapted from: 3d92acf
Original PR had explosive GPU memory requirement. I'm not sure if it's a bug, or intended logic of the allocator.
I worked around it by reusing the same tensor. It seems to work well for me.

@github-actions github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Sep 20, 2024
@slaren
Copy link
Collaborator

slaren commented Sep 21, 2024

I cannot reproduce the issue that you reported with memory usage, in my tests the allocator correctly reuses the memory of the previous tensors automatically. The ggml_backend_sched_set_tensor_backend is necessary, but the other changes should be reverted to keep the code simple.

It would also be very desirable to implement this in the CPU backend, and add a test in test-backend-ops for the new copy op.

For cases where the input and outputs of the copy are contiguous (as it is here), then this could also be implemented using the existing dequantize functions in the CUDA backend, which would allow it to work with any format and likely with better performance.

## SPLIT #0: CUDA0 # 1 inputs: [K_shift (   0K)]
node #  1 (       CPY):              K_f32-0 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-0 (  32K) [CUDA0         ]
node #  2 (      ROPE):       K_f32-0 (view) (  32K) [CUDA0         ]:              K_f32-0 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node #  3 (       CPY):          K_shifted-0 (   8K) [CUDA0         ]:       K_f32-0 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node #  5 (       CPY):              K_f32-1 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-1 (  32K) [CUDA0         ]
node #  6 (      ROPE):       K_f32-1 (view) (  32K) [CUDA0         ]:              K_f32-1 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node #  7 (       CPY):          K_shifted-1 (   8K) [CUDA0         ]:       K_f32-1 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node #  9 (       CPY):              K_f32-2 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-2 (  32K) [CUDA0         ]
node # 10 (      ROPE):       K_f32-2 (view) (  32K) [CUDA0         ]:              K_f32-2 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 11 (       CPY):          K_shifted-2 (   8K) [CUDA0         ]:       K_f32-2 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 13 (       CPY):              K_f32-3 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-3 (  32K) [CUDA0         ]
node # 14 (      ROPE):       K_f32-3 (view) (  32K) [CUDA0         ]:              K_f32-3 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 15 (       CPY):          K_shifted-3 (   8K) [CUDA0         ]:       K_f32-3 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 17 (       CPY):              K_f32-4 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-4 (  32K) [CUDA0         ]
node # 18 (      ROPE):       K_f32-4 (view) (  32K) [CUDA0         ]:              K_f32-4 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 19 (       CPY):          K_shifted-4 (   8K) [CUDA0         ]:       K_f32-4 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 21 (       CPY):              K_f32-5 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-5 (  32K) [CUDA0         ]
node # 22 (      ROPE):       K_f32-5 (view) (  32K) [CUDA0         ]:              K_f32-5 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 23 (       CPY):          K_shifted-5 (   8K) [CUDA0         ]:       K_f32-5 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 25 (       CPY):              K_f32-6 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-6 (  32K) [CUDA0         ]
node # 26 (      ROPE):       K_f32-6 (view) (  32K) [CUDA0         ]:              K_f32-6 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 27 (       CPY):          K_shifted-6 (   8K) [CUDA0         ]:       K_f32-6 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 29 (       CPY):              K_f32-7 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-7 (  32K) [CUDA0         ]
node # 30 (      ROPE):       K_f32-7 (view) (  32K) [CUDA0         ]:              K_f32-7 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 31 (       CPY):          K_shifted-7 (   8K) [CUDA0         ]:       K_f32-7 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 33 (       CPY):              K_f32-8 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-8 (  32K) [CUDA0         ]
node # 34 (      ROPE):       K_f32-8 (view) (  32K) [CUDA0         ]:              K_f32-8 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 35 (       CPY):          K_shifted-8 (   8K) [CUDA0         ]:       K_f32-8 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 37 (       CPY):              K_f32-9 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-9 (  32K) [CUDA0         ]
node # 38 (      ROPE):       K_f32-9 (view) (  32K) [CUDA0         ]:              K_f32-9 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 39 (       CPY):          K_shifted-9 (   8K) [CUDA0         ]:       K_f32-9 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 41 (       CPY):             K_f32-10 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-10 (  32K) [CUDA0         ]
node # 42 (      ROPE):      K_f32-10 (view) (  32K) [CUDA0         ]:             K_f32-10 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 43 (       CPY):         K_shifted-10 (   8K) [CUDA0         ]:      K_f32-10 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 45 (       CPY):             K_f32-11 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-11 (  32K) [CUDA0         ]
node # 46 (      ROPE):      K_f32-11 (view) (  32K) [CUDA0         ]:             K_f32-11 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 47 (       CPY):         K_shifted-11 (   8K) [CUDA0         ]:      K_f32-11 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 49 (       CPY):             K_f32-12 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-12 (  32K) [CUDA0         ]
node # 50 (      ROPE):      K_f32-12 (view) (  32K) [CUDA0         ]:             K_f32-12 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 51 (       CPY):         K_shifted-12 (   8K) [CUDA0         ]:      K_f32-12 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 53 (       CPY):             K_f32-13 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-13 (  32K) [CUDA0         ]
node # 54 (      ROPE):      K_f32-13 (view) (  32K) [CUDA0         ]:             K_f32-13 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 55 (       CPY):         K_shifted-13 (   8K) [CUDA0         ]:      K_f32-13 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 57 (       CPY):             K_f32-14 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-14 (  32K) [CUDA0         ]
node # 58 (      ROPE):      K_f32-14 (view) (  32K) [CUDA0         ]:             K_f32-14 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 59 (       CPY):         K_shifted-14 (   8K) [CUDA0         ]:      K_f32-14 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 61 (       CPY):             K_f32-15 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-15 (  32K) [CUDA0         ]
node # 62 (      ROPE):      K_f32-15 (view) (  32K) [CUDA0         ]:             K_f32-15 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 63 (       CPY):         K_shifted-15 (   8K) [CUDA0         ]:      K_f32-15 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 65 (       CPY):             K_f32-16 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-16 (  32K) [CUDA0         ]
node # 66 (      ROPE):      K_f32-16 (view) (  32K) [CUDA0         ]:             K_f32-16 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 67 (       CPY):         K_shifted-16 (   8K) [CUDA0         ]:      K_f32-16 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]

## SPLIT #1: CUDA1 # 1 inputs: [K_shift (   0K)]
node # 69 (       CPY):             K_f32-17 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-17 (  32K) [CUDA1         ]
node # 70 (      ROPE):      K_f32-17 (view) (  32K) [CUDA1         ]:             K_f32-17 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 71 (       CPY):         K_shifted-17 (   8K) [CUDA1         ]:      K_f32-17 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 73 (       CPY):             K_f32-18 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-18 (  32K) [CUDA1         ]
node # 74 (      ROPE):      K_f32-18 (view) (  32K) [CUDA1         ]:             K_f32-18 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 75 (       CPY):         K_shifted-18 (   8K) [CUDA1         ]:      K_f32-18 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 77 (       CPY):             K_f32-19 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-19 (  32K) [CUDA1         ]
node # 78 (      ROPE):      K_f32-19 (view) (  32K) [CUDA1         ]:             K_f32-19 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 79 (       CPY):         K_shifted-19 (   8K) [CUDA1         ]:      K_f32-19 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 81 (       CPY):             K_f32-20 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-20 (  32K) [CUDA1         ]
node # 82 (      ROPE):      K_f32-20 (view) (  32K) [CUDA1         ]:             K_f32-20 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 83 (       CPY):         K_shifted-20 (   8K) [CUDA1         ]:      K_f32-20 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 85 (       CPY):             K_f32-21 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-21 (  32K) [CUDA1         ]
node # 86 (      ROPE):      K_f32-21 (view) (  32K) [CUDA1         ]:             K_f32-21 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 87 (       CPY):         K_shifted-21 (   8K) [CUDA1         ]:      K_f32-21 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
max_size = 0.00 MB: tensors: K_shift [0-80] (0.00 MB)
max_size = 0.00 MB: tensors: CUDA0#K_shift#0 [0-80] (0.00 MB)
max_size = 0.03 MB: tensors: CUDA0#K_shift#0 [0-80] (0.00 MB) K_f32-0 [80-8080] (0.03 MB)
max_size = 0.00 MB: tensors: CUDA1#K_shift#0 [0-80] (0.00 MB)
max_size = 0.03 MB: tensors: CUDA1#K_shift#0 [0-80] (0.00 MB) K_f32-17 [80-8080] (0.03 MB)

@Nekotekina
Copy link
Contributor Author

Right, I simplified it. Thanks. But where are dequantize functions you are talking about? You mean ggml_get_to_fp32_cuda from convert.cu?

@Nekotekina
Copy link
Contributor Author

I also forgot to ask, is there a reason to use f32 over f16? The performance of q8_0 K-Shift doesn't seem great in this PR, I wonder if using f16 can improve it...

@slaren
Copy link
Collaborator

slaren commented Sep 22, 2024

Yes, I mean the functions from convert.cu, it should be straightforward to use these for GGML_OP_CPY when both src0 and src1 are contiguous. Using F16 should also work, and at least it would reduce the buffer size, which is always good, but I wouldn't expect a big performance difference. I think the quantization kernels could be optimized to use more threads (one thread per value instead of per block), which should improve performance.

llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
@slaren slaren merged commit 116efee into ggerganov:master Sep 24, 2024
53 checks passed
@neavo
Copy link

neavo commented Sep 24, 2024

This is a reworked #5653 Some CUDA code was adapted from: 3d92acf Original PR had explosive GPU memory requirement. I'm not sure if it's a bug, or intended logic of the allocator. I worked around it by reusing the same tensor. It seems to work well for me.

Great, it works well in actual use. Is there a plan to support more quantitative specifications, such as q4_0?

@Nekotekina
Copy link
Contributor Author

I'd like to improve it more, as @slaren mentioned, but I'm confused a bit by APIs.

ggml_get_to_fp16_cuda(src0->type)(src0_ddc, reinterpret_cast<half*>(src1_ddc), ggml_nelements(src1), main_stream);

If I use convert function like this, what should be in k? Is the number of f16/f32 elements ok?

@slaren
Copy link
Collaborator

slaren commented Sep 24, 2024

Yes, that looks right.

@Nekotekina
Copy link
Contributor Author

Nekotekina commented Sep 25, 2024

Hmm, ggml_nelements(src1) seems to corrupt KV cache after all. I'll try to create PR later. Or something other I did wrong.

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
llama: enable K-shift for quantized KV cache
It will fail on unsupported backends or quant types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants