ggml : extend ggml_mul_mat to support non-F32 input for parameter `b`

Currently, we always pass `b` to `ggml_mul_mat` as F32 and internally quantize it depending on the type of `a`.
There is no option that allows to pass an already quantized `b`.

The primary goal of this task is to add such option.
For more info, see: https://github.com/ggerganov/llama.cpp/pull/2615#issuecomment-1680270900

The primary focus will be `ggml_mul_mat`, but we can also think about some more general approach for the rest of the operators. For example, `ggml_mul` currently also works with just F32 input, which prevents from having 1D F16 norm tensors. This is not a huge drawback since these tensors are usually small, but would be nice to also support F16.

Additionally, we can extend `ggml` with parameters that control the implicit quantizations.
I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : extend ggml_mul_mat to support non-F32 input for parameter `b` #455

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ggml : extend ggml_mul_mat to support non-F32 input for parameter b #455

Description

Activity