Skip to content

Architecture: what if I want to optimize for llama.cpp? #3390

Closed
@olegklimov

Description

@olegklimov

We have our model converted to gguf with quantization, shout out to @teleprint-me and @ds5t5.

But it's still slow, our problem is the prompt. The speed is about 500 tps for prefill (Apple M1), which is way to slow for practical use. For fill-in-the-middle code completion, the user will have to wait 4 seconds for a typical 2000 tokens context.

We train our own models, so the question is: what if we change the architecture? What is the bottleneck for prefill? How do we make it 5-10x faster, besides making the network smaller?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions