Description
Currently, we always pass b
to ggml_mul_mat
as F32 and internally quantize it depending on the type of a
.
There is no option that allows to pass an already quantized b
.
The primary goal of this task is to add such option.
For more info, see: ggml-org/llama.cpp#2615 (comment)
The primary focus will be ggml_mul_mat
, but we can also think about some more general approach for the rest of the operators. For example, ggml_mul
currently also works with just F32 input, which prevents from having 1D F16 norm tensors. This is not a huge drawback since these tensors are usually small, but would be nice to also support F16.
Additionally, we can extend ggml
with parameters that control the implicit quantizations.
I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV
Activity