Replies: 1 comment
-
I wouldn't expect it to make a difference, but maybe there is some issue with the way batched matrix multiplications are scheduled to threads. It would be useful if you can reproduce this by adding a perf test case in |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm trying to get to the bottom of the problems with the
deepseek2-mla
code having horrible performance on quantised tensors, and have got it down to a couple of tensors that are stored as 2D but then viewed as 3D like this:These same tenors have no problem if they are stored as
F16
,BF16
orF32
, butQ8_0
or any other quant and they completely tank when they get used in a non-broadcasted batch-MM:am I running into some memory alignment problem here and would storing the same tensors as 3D in the GGUF:
align the 2d dimension to a better boundary compared to the
ggml_view_3d()
call?(I'm trying this now but it will take several hours to requant the model to use 3D for the problematic tensors)
If not, then is there any way these can be aligned/padded to help with this?
Beta Was this translation helpful? Give feedback.
All reactions