You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to tell the llama.cpp server to cache prompts when using the v1/chat/completions endpoint?
I've a CLI interface I created for fiction authors that accesses the OpenAI endpoints. I want to enable it to access local models via llama.cpp server. I've got it working now, but the response is very slow because it's re-evaluating the entire accumulated prompt with each request. I see that the /completions endpoint supports a cache flag, but I don't see one for the v1/chat/completions endpoint.
The text was updated successfully, but these errors were encountered:
Is it possible to tell the llama.cpp server to cache prompts when using the
v1/chat/completions
endpoint?I've a CLI interface I created for fiction authors that accesses the OpenAI endpoints. I want to enable it to access local models via llama.cpp server. I've got it working now, but the response is very slow because it's re-evaluating the entire accumulated prompt with each request. I see that the
/completions
endpoint supports a cache flag, but I don't see one for thev1/chat/completions
endpoint.The text was updated successfully, but these errors were encountered: