Closed
Description
Since the server is one of the goals / highlights of this project. I'm planning to move it into a subpackage e.g. llama-cpp-python[server]
or something like that.
Work that needs to be done first:
- Ensure compatibility with OpenAI
- Response objects match
- Request objects match
- Loaded model appears under
/v1/models
endpoint -
Test OpenAI client libraries - Unsupported parameters should be silently ignored
- Ease-of-use
- Integrate server as a subpackage
- CLI tool to run the server
Future work
- Prompt caching to improve latency
- Support multiple models in the same server
- Add tokenization endpoints to make it easier to make it easier for small clients to calculate context window sizes
Metadata
Metadata
Assignees
Labels
No labels