You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: llama.cpp/server/README.md
+51Lines changed: 51 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,6 +122,8 @@ node index.js
122
122
123
123
`top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95).
124
124
125
+
`min_p`: The minimum probability for a token to be considered, relative to the probability of the most likely token (default: 0.05).
126
+
125
127
`n_predict`: Set the maximum number of tokens to predict when generating text. **Note:** May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity).
126
128
127
129
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded.
@@ -232,6 +234,55 @@ node index.js
232
234
233
235
-**GET**`/props`: Return the required assistant name and anti-prompt to generate the prompt in case you have specified a system prompt for all slots.
234
236
237
+
-**POST**`/v1/chat/completions`: OpenAI-compatible Chat Completions API. Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only ChatML-tuned models, such as Dolphin, OpenOrca, OpenHermes, OpenChat-3.5, etc can be used with this endpoint. Compared to `api_like_OAI.py` this API implementation does not require a wrapper to be served.
238
+
239
+
*Options:*
240
+
241
+
See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such are `mirostat` are supported.
242
+
243
+
*Examples:*
244
+
245
+
You can use either Python `openai` library with appropriate checkpoints:
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
259
+
{"role": "user", "content": "Write a limerick about python exceptions"}
260
+
]
261
+
)
262
+
263
+
print(completion.choices[0].message)
264
+
```
265
+
...or raw HTTP requests:
266
+
267
+
```shell
268
+
curl http://localhost:8080/v1/chat/completions \
269
+
-H "Content-Type: application/json" \
270
+
-H "Authorization: Bearer no-key" \
271
+
-d '{
272
+
"model": "gpt-3.5-turbo",
273
+
"messages": [
274
+
{
275
+
"role": "system",
276
+
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
277
+
},
278
+
{
279
+
"role": "user",
280
+
"content": "Write a limerick about python exceptions"
0 commit comments