Skip to content

Flaky server responses with llama 3 #6785

Closed
@kurnevsky

Description

@kurnevsky

I noticed that some of the responses I got from llama-cpp server (latest master) are unnaturally fast for 70b model, and it happens randomly. And when this happens the response has worse quality. The model I'm using is https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/blob/main/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf with the command line llama-server -m Meta-Llama-3-70B-Instruct-Q5_K_M.gguf -c 0 -t 24 -ngl 24. It's only partially offloaded to gpu (with rocm on linux) so maybe somehow llama-cpp doesn't use all layers when it responds quickly.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions