Closed
Description
Is it possible to tell the llama.cpp server to cache prompts when using the v1/chat/completions
endpoint?
I've a CLI interface I created for fiction authors that accesses the OpenAI endpoints. I want to enable it to access local models via llama.cpp server. I've got it working now, but the response is very slow because it's re-evaluating the entire accumulated prompt with each request. I see that the /completions
endpoint supports a cache flag, but I don't see one for the v1/chat/completions
endpoint.
Metadata
Metadata
Assignees
Labels
No labels