Skip to content

Commit 401dd08

Browse files
ggerganovjart
authored andcommitted
Add OpenAI API compatibility to server
This is a cherry-pick of ggml-org/llama.cpp@af19d35
1 parent ed87fdb commit 401dd08

File tree

2 files changed

+407
-8
lines changed

2 files changed

+407
-8
lines changed

llama.cpp/server/README.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,8 @@ node index.js
122122

123123
`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95).
124124

125+
`min_p`: The minimum probability for a token to be considered, relative to the probability of the most likely token (default: 0.05).
126+
125127
`n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity).
126128

127129
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded.
@@ -232,6 +234,55 @@ node index.js
232234

233235
- **GET** `/props`: Return the required assistant name and anti-prompt to generate the prompt in case you have specified a system prompt for all slots.
234236

237+
- **POST** `/v1/chat/completions`: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only ChatML-tuned models, such as Dolphin, OpenOrca, OpenHermes, OpenChat-3.5, etc can be used with this endpoint. Compared to `api_like_OAI.py` this API implementation does not require a wrapper to be served.
238+
239+
*Options:*
240+
241+
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such are `mirostat` are supported.
242+
243+
*Examples:*
244+
245+
You can use either Python `openai` library with appropriate checkpoints:
246+
247+
```python
248+
import openai
249+
250+
client = openai.OpenAI(
251+
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
252+
api_key = "sk-no-key-required"
253+
)
254+
255+
completion = client.chat.completions.create(
256+
model="gpt-3.5-turbo",
257+
messages=[
258+
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
259+
{"role": "user", "content": "Write a limerick about python exceptions"}
260+
]
261+
)
262+
263+
print(completion.choices[0].message)
264+
```
265+
... or raw HTTP requests:
266+
267+
```shell
268+
curl http://localhost:8080/v1/chat/completions \
269+
-H "Content-Type: application/json" \
270+
-H "Authorization: Bearer no-key" \
271+
-d '{
272+
"model": "gpt-3.5-turbo",
273+
"messages": [
274+
{
275+
"role": "system",
276+
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
277+
},
278+
{
279+
"role": "user",
280+
"content": "Write a limerick about python exceptions"
281+
}
282+
]
283+
}'
284+
```
285+
235286
## More examples
236287

237288
### Change system prompt on runtime

0 commit comments

Comments
 (0)