-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make some tests and choose an openai api compatible local llm server #7
Comments
Could be worth to try a modal-deployment of the lmm server with modal as well |
interesting pr for vlmm with respect to speculative decoding vllm-project/vllm#2188 and fused moe kernels vllm-project/vllm#2913 vllm-project/vllm#2979 |
The neural network architecture used as the language model will be Mixtral. The server must meet the following requirements: Structured extraction using Pydantic. The goal and evaluation will be the speed in terms of reading and writing of the inference server. In particular, we are interested in knowing how much time it takes to read and write one million tokens with the same structured extraction task on about a thousand documents in parallel. ---> also on Modal |
https://github.com/ollama/ollama
https://github.com/abetlen/llama-cpp-python
https://github.com/vllm-project/vllm
The text was updated successfully, but these errors were encountered: