CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Nekotekina · 2024-09-20T19:11:00Z

This is a reworked #5653
Some CUDA code was adapted from: 3d92acf
Original PR had explosive GPU memory requirement. ~~I'm not sure if it's a bug, or intended logic of the allocator.~~
~~I worked around it by reusing the same tensor. It seems to work well for me.~~

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-09-21T01:20:08Z

I cannot reproduce the issue that you reported with memory usage, in my tests the allocator correctly reuses the memory of the previous tensors automatically. The ggml_backend_sched_set_tensor_backend is necessary, but the other changes should be reverted to keep the code simple.

It would also be very desirable to implement this in the CPU backend, and add a test in test-backend-ops for the new copy op.

For cases where the input and outputs of the copy are contiguous (as it is here), then this could also be implemented using the existing dequantize functions in the CUDA backend, which would allow it to work with any format and likely with better performance.

## SPLIT #0: CUDA0 # 1 inputs: [K_shift (   0K)]
node #  1 (       CPY):              K_f32-0 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-0 (  32K) [CUDA0         ]
node #  2 (      ROPE):       K_f32-0 (view) (  32K) [CUDA0         ]:              K_f32-0 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node #  3 (       CPY):          K_shifted-0 (   8K) [CUDA0         ]:       K_f32-0 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node #  5 (       CPY):              K_f32-1 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-1 (  32K) [CUDA0         ]
node #  6 (      ROPE):       K_f32-1 (view) (  32K) [CUDA0         ]:              K_f32-1 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node #  7 (       CPY):          K_shifted-1 (   8K) [CUDA0         ]:       K_f32-1 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node #  9 (       CPY):              K_f32-2 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-2 (  32K) [CUDA0         ]
node # 10 (      ROPE):       K_f32-2 (view) (  32K) [CUDA0         ]:              K_f32-2 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 11 (       CPY):          K_shifted-2 (   8K) [CUDA0         ]:       K_f32-2 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 13 (       CPY):              K_f32-3 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-3 (  32K) [CUDA0         ]
node # 14 (      ROPE):       K_f32-3 (view) (  32K) [CUDA0         ]:              K_f32-3 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 15 (       CPY):          K_shifted-3 (   8K) [CUDA0         ]:       K_f32-3 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 17 (       CPY):              K_f32-4 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-4 (  32K) [CUDA0         ]
node # 18 (      ROPE):       K_f32-4 (view) (  32K) [CUDA0         ]:              K_f32-4 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 19 (       CPY):          K_shifted-4 (   8K) [CUDA0         ]:       K_f32-4 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 21 (       CPY):              K_f32-5 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-5 (  32K) [CUDA0         ]
node # 22 (      ROPE):       K_f32-5 (view) (  32K) [CUDA0         ]:              K_f32-5 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 23 (       CPY):          K_shifted-5 (   8K) [CUDA0         ]:       K_f32-5 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 25 (       CPY):              K_f32-6 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-6 (  32K) [CUDA0         ]
node # 26 (      ROPE):       K_f32-6 (view) (  32K) [CUDA0         ]:              K_f32-6 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 27 (       CPY):          K_shifted-6 (   8K) [CUDA0         ]:       K_f32-6 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 29 (       CPY):              K_f32-7 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-7 (  32K) [CUDA0         ]
node # 30 (      ROPE):       K_f32-7 (view) (  32K) [CUDA0         ]:              K_f32-7 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 31 (       CPY):          K_shifted-7 (   8K) [CUDA0         ]:       K_f32-7 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 33 (       CPY):              K_f32-8 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-8 (  32K) [CUDA0         ]
node # 34 (      ROPE):       K_f32-8 (view) (  32K) [CUDA0         ]:              K_f32-8 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 35 (       CPY):          K_shifted-8 (   8K) [CUDA0         ]:       K_f32-8 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 37 (       CPY):              K_f32-9 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]              K_f32-9 (  32K) [CUDA0         ]
node # 38 (      ROPE):       K_f32-9 (view) (  32K) [CUDA0         ]:              K_f32-9 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 39 (       CPY):          K_shifted-9 (   8K) [CUDA0         ]:       K_f32-9 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 41 (       CPY):             K_f32-10 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-10 (  32K) [CUDA0         ]
node # 42 (      ROPE):      K_f32-10 (view) (  32K) [CUDA0         ]:             K_f32-10 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 43 (       CPY):         K_shifted-10 (   8K) [CUDA0         ]:      K_f32-10 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 45 (       CPY):             K_f32-11 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-11 (  32K) [CUDA0         ]
node # 46 (      ROPE):      K_f32-11 (view) (  32K) [CUDA0         ]:             K_f32-11 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 47 (       CPY):         K_shifted-11 (   8K) [CUDA0         ]:      K_f32-11 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 49 (       CPY):             K_f32-12 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-12 (  32K) [CUDA0         ]
node # 50 (      ROPE):      K_f32-12 (view) (  32K) [CUDA0         ]:             K_f32-12 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 51 (       CPY):         K_shifted-12 (   8K) [CUDA0         ]:      K_f32-12 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 53 (       CPY):             K_f32-13 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-13 (  32K) [CUDA0         ]
node # 54 (      ROPE):      K_f32-13 (view) (  32K) [CUDA0         ]:             K_f32-13 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 55 (       CPY):         K_shifted-13 (   8K) [CUDA0         ]:      K_f32-13 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 57 (       CPY):             K_f32-14 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-14 (  32K) [CUDA0         ]
node # 58 (      ROPE):      K_f32-14 (view) (  32K) [CUDA0         ]:             K_f32-14 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 59 (       CPY):         K_shifted-14 (   8K) [CUDA0         ]:      K_f32-14 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 61 (       CPY):             K_f32-15 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-15 (  32K) [CUDA0         ]
node # 62 (      ROPE):      K_f32-15 (view) (  32K) [CUDA0         ]:             K_f32-15 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 63 (       CPY):         K_shifted-15 (   8K) [CUDA0         ]:      K_f32-15 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]
node # 65 (       CPY):             K_f32-16 (  32K) [CUDA0         ]:               K-view (   8K) [CUDA0         ]             K_f32-16 (  32K) [CUDA0         ]
node # 66 (      ROPE):      K_f32-16 (view) (  32K) [CUDA0         ]:             K_f32-16 (  32K) [CUDA0         ]      CUDA0#K_shift#0 (   0K) [ NULL         ]
node # 67 (       CPY):         K_shifted-16 (   8K) [CUDA0         ]:      K_f32-16 (view) (  32K) [CUDA0         ]               K-view (   8K) [CUDA0         ]

## SPLIT #1: CUDA1 # 1 inputs: [K_shift (   0K)]
node # 69 (       CPY):             K_f32-17 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-17 (  32K) [CUDA1         ]
node # 70 (      ROPE):      K_f32-17 (view) (  32K) [CUDA1         ]:             K_f32-17 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 71 (       CPY):         K_shifted-17 (   8K) [CUDA1         ]:      K_f32-17 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 73 (       CPY):             K_f32-18 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-18 (  32K) [CUDA1         ]
node # 74 (      ROPE):      K_f32-18 (view) (  32K) [CUDA1         ]:             K_f32-18 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 75 (       CPY):         K_shifted-18 (   8K) [CUDA1         ]:      K_f32-18 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 77 (       CPY):             K_f32-19 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-19 (  32K) [CUDA1         ]
node # 78 (      ROPE):      K_f32-19 (view) (  32K) [CUDA1         ]:             K_f32-19 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 79 (       CPY):         K_shifted-19 (   8K) [CUDA1         ]:      K_f32-19 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 81 (       CPY):             K_f32-20 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-20 (  32K) [CUDA1         ]
node # 82 (      ROPE):      K_f32-20 (view) (  32K) [CUDA1         ]:             K_f32-20 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 83 (       CPY):         K_shifted-20 (   8K) [CUDA1         ]:      K_f32-20 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
node # 85 (       CPY):             K_f32-21 (  32K) [CUDA1         ]:               K-view (   8K) [CUDA1         ]             K_f32-21 (  32K) [CUDA1         ]
node # 86 (      ROPE):      K_f32-21 (view) (  32K) [CUDA1         ]:             K_f32-21 (  32K) [CUDA1         ]      CUDA1#K_shift#0 (   0K) [ NULL         ]
node # 87 (       CPY):         K_shifted-21 (   8K) [CUDA1         ]:      K_f32-21 (view) (  32K) [CUDA1         ]               K-view (   8K) [CUDA1         ]
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
max_size = 0.00 MB: tensors: K_shift [0-80] (0.00 MB)
max_size = 0.00 MB: tensors: CUDA0#K_shift#0 [0-80] (0.00 MB)
max_size = 0.03 MB: tensors: CUDA0#K_shift#0 [0-80] (0.00 MB) K_f32-0 [80-8080] (0.03 MB)
max_size = 0.00 MB: tensors: CUDA1#K_shift#0 [0-80] (0.00 MB)
max_size = 0.03 MB: tensors: CUDA1#K_shift#0 [0-80] (0.00 MB) K_f32-17 [80-8080] (0.03 MB)

Nekotekina · 2024-09-21T08:06:42Z

Right, I simplified it. Thanks. But where are dequantize functions you are talking about? You mean ggml_get_to_fp32_cuda from convert.cu?

Nekotekina · 2024-09-21T19:07:10Z

I also forgot to ask, is there a reason to use f32 over f16? The performance of q8_0 K-Shift doesn't seem great in this PR, I wonder if using f16 can improve it...

slaren · 2024-09-22T02:26:47Z

Yes, I mean the functions from convert.cu, it should be straightforward to use these for GGML_OP_CPY when both src0 and src1 are contiguous. Using F16 should also work, and at least it would reduce the buffer size, which is always good, but I wouldn't expect a big performance difference. I think the quantization kernels could be optimized to use more threads (one thread per value instead of per block), which should improve performance.

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

neavo · 2024-09-24T05:51:54Z

This is a reworked #5653 Some CUDA code was adapted from: 3d92acf Original PR had explosive GPU memory requirement. ~~I'm not sure if it's a bug, or intended logic of the allocator.~~ ~~I worked around it by reusing the same tensor. It seems to work well for me.~~

I have read the contributing guidelines

Self-reported review complexity:

Low

Medium

High

Great, it works well in actual use. Is there a plan to support more quantitative specifications, such as q4_0?

Nekotekina · 2024-09-24T18:18:52Z

I'd like to improve it more, as @slaren mentioned, but I'm confused a bit by APIs.

ggml_get_to_fp16_cuda(src0->type)(src0_ddc, reinterpret_cast<half*>(src1_ddc), ggml_nelements(src1), main_stream);

If I use convert function like this, what should be in k? Is the number of f16/f32 elements ok?

slaren · 2024-09-24T18:27:54Z

Yes, that looks right.

Nekotekina · 2024-09-25T20:27:44Z

Hmm, ggml_nelements(src1) seems to corrupt KV cache after all. I'll try to create PR later. Or something other I did wrong.

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Sep 20, 2024

Nekotekina force-pushed the kshift branch from 2cbee0b to 6dfc5b1 Compare September 20, 2024 21:33

Nekotekina force-pushed the kshift branch 2 times, most recently from ea88404 to eec216c Compare September 21, 2024 07:54

slaren approved these changes Sep 22, 2024

View reviewed changes

cuda: add q8_0->f32 cpy operation

c4d6f34

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

Nekotekina force-pushed the kshift branch from eec216c to c4d6f34 Compare September 22, 2024 08:19

slaren merged commit 116efee into ggerganov:master Sep 24, 2024
53 checks passed

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

cuda: add q8_0->f32 cpy operation (ggerganov#9571)

046bd08

llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Nekotekina commented Sep 20, 2024 •

edited

Loading

slaren commented Sep 21, 2024 •

edited

Loading

Nekotekina commented Sep 21, 2024

Nekotekina commented Sep 21, 2024

slaren commented Sep 22, 2024

neavo commented Sep 24, 2024

Nekotekina commented Sep 24, 2024

slaren commented Sep 24, 2024

Nekotekina commented Sep 25, 2024 •

edited

Loading

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Conversation

Nekotekina commented Sep 20, 2024 • edited Loading

slaren commented Sep 21, 2024 • edited Loading

Nekotekina commented Sep 21, 2024

Nekotekina commented Sep 21, 2024

slaren commented Sep 22, 2024

neavo commented Sep 24, 2024

Nekotekina commented Sep 24, 2024

slaren commented Sep 24, 2024

Nekotekina commented Sep 25, 2024 • edited Loading

Nekotekina commented Sep 20, 2024 •

edited

Loading

slaren commented Sep 21, 2024 •

edited

Loading

Nekotekina commented Sep 25, 2024 •

edited

Loading