Efficient, easy-to-use platform for inference and serving local LLMs including an OpenAI compatible API server.
- OpenAI compatible API server provided for serving LLMs.
- Highly extensible trait-based system to allow rapid implementation of new module pipelines,
- Streaming support in generation.
- Llama
- 7b
- 13b
- 70b
- Mistral
- 7b
See this folder for some examples.
In your terminal, install the openai
Python package by running pip install openai
.
Then, create a new Python file and write the following code:
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:2000/v1/"
completion = openai.chat.completions.create(
model="llama7b",
messages=[
{
"role": "user",
"content": "Explain how to best learn Rust.",
},
],
max_tokens = 64,
)
print(completion.choices[0].message.content)
Next, launch a candle-vllm
instance by running HF_TOKEN=... cargo run --release -- --hf-token HF_TOKEN --port 2000 llama7b --repeat-last-n 64
.
After the candle-vllm
instance is running, run the Python script and enjoy efficient inference with an OpenAI compatible API server!
The following features are planned to be implemented, but contributions are especially welcome:
- Sampling methods:
- Beam search (huggingface/candle#1319)
- Pipeline batching (#3)
- PagedAttention (#3)
- More pipelines (from
candle-transformers
)
- Python implementation:
vllm-project
vllm
paper