Architecture: what if I want to optimize for llama.cpp?

We have our model [converted to gguf](https://github.com/ggerganov/llama.cpp/pull/3329) with quantization, shout out to @teleprint-me and @ds5t5.

But it's still slow, our problem is the prompt. The speed is about 500 tps for prefill (Apple M1), which is way to slow for practical use. For fill-in-the-middle code completion, the user will have to wait 4 seconds for a typical 2000 tokens context.

We train our own models, so the question is: what if we change the architecture? What is the bottleneck for prefill? How do we make it 5-10x faster, besides making the network smaller?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Architecture: what if I want to optimize for llama.cpp? #3390

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Architecture: what if I want to optimize for llama.cpp? #3390

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions