Skip to content

Cache when using v1/chat/completions? #4287

Closed
@Michael-F-Ellis

Description

@Michael-F-Ellis

Is it possible to tell the llama.cpp server to cache prompts when using the v1/chat/completions endpoint?

I've a CLI interface I created for fiction authors that accesses the OpenAI endpoints. I want to enable it to access local models via llama.cpp server. I've got it working now, but the response is very slow because it's re-evaluating the entire accumulated prompt with each request. I see that the /completions endpoint supports a cache flag, but I don't see one for the v1/chat/completions endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions