Closed
Description
Currently, we store separate tensors for each expert:
This leads to large number of possible "source" tensors for the _id
ops which increases significantly the size of struct ggml_tensor
on the stack:
Additionally, the Metal implementation is currently hacked to support up to 8 experts and extension to more than that is not completely obvious:
We should improve this, with one possible way being to store the data for the experts into a single tensor and address is with appropriate offsets