ggml : extend ggml_mul_mat to support non-F32 input for parameter `b` #455

ggerganov · 2023-08-16T09:47:28Z

Currently, we always pass b to ggml_mul_mat as F32 and internally quantize it depending on the type of a.
There is no option that allows to pass an already quantized b.

The primary goal of this task is to add such option.
For more info, see: ggerganov/llama.cpp#2615 (comment)

The primary focus will be ggml_mul_mat, but we can also think about some more general approach for the rest of the operators. For example, ggml_mul currently also works with just F32 input, which prevents from having 1D F16 norm tensors. This is not a huge drawback since these tensors are usually small, but would be nice to also support F16.

Additionally, we can extend ggml with parameters that control the implicit quantizations.
I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV

The text was updated successfully, but these errors were encountered:

klosax · 2023-08-16T19:21:52Z

I guess this is the reason why 1D tensors needs to be stored as F32 in the model files.

klosax · 2023-08-16T19:24:08Z

This also seems to be needed for PR ggerganov/llama.cpp#2632

ggerganov · 2023-08-16T19:35:19Z

I guess this is the reason why 1D tensors needs to be stored as F32 in the model files.

Not exactly - it is a similar problem, but it is related to ggml_mul, not ggml_mul_mat.
I've expanded the issue description to mention this and we can try to fix this within the scope of this task.

slaren · 2023-08-19T01:58:37Z

Additionally, we can extend ggml with parameters that control the implicit quantizations.
I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV

We could add a flag to ggml_context to control whether to automatically quantize b in ggml_mul_mat. Essentially, we would add this to ggml_mul_mat:

if (ctx->quantize_matmul && b->type != type_traits[a->type].vec_dot_type) {
    struct ggml_tensor * tmp_b = ggml_new_tensor(ctx, type_traits[a->type].vec_dot_type, b->n_dims, b->ne);
    b = ggml_cpy(ctx, b, tmp_b);
}

The disadvantage is that it may add a bit of overhead by increasing the size of the graphs, but it shouldn't be too bad. To avoid this, we could set a flag in op_params instead, and let each backend handle the conversion, but that doesn't solve the memory management issue.

slaren · 2024-01-02T17:33:42Z

Somewhat related to this, it would be good to have a ggml_convert(src, new_type) or similar that is equivalent to ggml_cpy(src, ggml_new_tensor(new_type, 4, src->ne)), but only creates one tensor. Currently, converting a tensor to a different type requires adding two tensors to the context, and this may become a not-completely-insignificant overhead if most matrix multiplications require one or two conversions. It would also result in a slightly more intuitive API, since currently it is not too obvious that ggml_cpy can be used to convert a tensor to a different type.

ggerganov added the refactoring Refactoring label Aug 16, 2023

ggerganov mentioned this issue Aug 16, 2023

metal: matrix-matrix multiplication kernel ggerganov/llama.cpp#2615

Merged

ggerganov mentioned this issue Aug 17, 2023

metal: enable ggml-alloc ggerganov/llama.cpp#2627

Merged

slaren mentioned this issue Jan 2, 2024

cuda : speed-up by using CUBLAS_COMPUTE_32F instead of CUBLAS_COMPUTE_16F ggerganov/llama.cpp#3816

Closed

slaren mentioned this issue Jan 8, 2024

llama : ggml-backend integration ggerganov/llama.cpp#4766

Merged

7 tasks

ggerganov mentioned this issue Jan 26, 2024

New Features to run MobileVLM on orin ggerganov/llama.cpp#5132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : extend ggml_mul_mat to support non-F32 input for parameter `b` #455

ggml : extend ggml_mul_mat to support non-F32 input for parameter `b` #455

ggerganov commented Aug 16, 2023 •

edited

Loading

klosax commented Aug 16, 2023

klosax commented Aug 16, 2023

ggerganov commented Aug 16, 2023 •

edited

Loading

slaren commented Aug 19, 2023

slaren commented Jan 2, 2024 •

edited

Loading

ggml : extend ggml_mul_mat to support non-F32 input for parameter b #455

ggml : extend ggml_mul_mat to support non-F32 input for parameter b #455

Comments

ggerganov commented Aug 16, 2023 • edited Loading

klosax commented Aug 16, 2023

klosax commented Aug 16, 2023

ggerganov commented Aug 16, 2023 • edited Loading

slaren commented Aug 19, 2023

slaren commented Jan 2, 2024 • edited Loading

ggml : extend ggml_mul_mat to support non-F32 input for parameter `b` #455

ggml : extend ggml_mul_mat to support non-F32 input for parameter `b` #455

ggerganov commented Aug 16, 2023 •

edited

Loading

ggerganov commented Aug 16, 2023 •

edited

Loading

slaren commented Jan 2, 2024 •

edited

Loading