Closed
Description
We have our model converted to gguf with quantization, shout out to @teleprint-me and @ds5t5.
But it's still slow, our problem is the prompt. The speed is about 500 tps for prefill (Apple M1), which is way to slow for practical use. For fill-in-the-middle code completion, the user will have to wait 4 seconds for a typical 2000 tokens context.
We train our own models, so the question is: what if we change the architecture? What is the bottleneck for prefill? How do we make it 5-10x faster, besides making the network smaller?
Metadata
Metadata
Assignees
Labels
No labels