You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/build.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -198,6 +198,8 @@ The following compilation options are also available to tweak performance:
198
198
199
199
### MUSA
200
200
201
+
This provides GPU acceleration using the MUSA cores of your Moore Threads MTT GPU. Make sure to have the MUSA SDK installed. You can download it from here: [MUSA SDK](https://developer.mthreads.com/sdk/download/musa).
202
+
201
203
- Using `make`:
202
204
```bash
203
205
make GGML_MUSA=1
@@ -209,6 +211,12 @@ The following compilation options are also available to tweak performance:
209
211
cmake --build build --config Release
210
212
```
211
213
214
+
The environment variable [`MUSA_VISIBLE_DEVICES`](https://docs.mthreads.com/musa-sdk/musa-sdk-doc-online/programming_guide/Z%E9%99%84%E5%BD%95/) can be used to specify which GPU(s) will be used.
215
+
216
+
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted.
217
+
218
+
Most of the compilation options available for CUDA should also be available for MUSA, though they haven't been thoroughly tested yet.
219
+
212
220
### hipBLAS
213
221
214
222
This provides BLAS acceleration on HIP-supported AMD GPUs.
|`--props`| enable changing global properties via POST /props (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_PROPS) |
@@ -320,7 +317,6 @@ node index.js
320
317
321
318
- The prompt is a string or an array with the first element given as a string
322
319
- The model's `tokenizer.ggml.add_bos_token` metadata is `true`
323
-
- The system prompt is empty
324
320
325
321
`temperature`: Adjust the randomness of the generated text. Default: `0.8`
326
322
@@ -378,6 +374,8 @@ node index.js
378
374
379
375
`min_keep`: If greater than 0, force samplers to return N possible tokens at minimum. Default: `0`
380
376
377
+
`t_max_predict_ms`: Set a time limit in milliseconds for the prediction (a.k.a. text-generation) phase. The timeout will trigger if the generation takes more than the specified time (measured since the first token was generated) and if a new-line character has already been generated. Useful for FIM applications. Default: `0`, which is disabled.
378
+
381
379
`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `prompt`. You can determine the place of the image in the prompt as in the following: `USER:[img-12]Describe the image in detail.\nASSISTANT:`. In this case, `[img-12]` will be replaced by the embeddings of the image with id `12` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 12}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.
382
380
383
381
`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot. Default: `-1`
@@ -526,7 +524,7 @@ Takes a prefix and a suffix and returns the predicted completion as stream.
526
524
-`input_prefix`: Set the prefix of the code to infill.
527
525
-`input_suffix`: Set the suffix of the code to infill.
528
526
529
-
It also accepts all the options of `/completion` except `stream` and `prompt`.
527
+
It also accepts all the options of `/completion`.
530
528
531
529
### **GET**`/props`: Get server global properties.
532
530
@@ -536,14 +534,12 @@ This endpoint is public (no API key check). By default, it is read-only. To make
536
534
537
535
```json
538
536
{
539
-
"system_prompt": "",
540
537
"default_generation_settings": { ... },
541
538
"total_slots": 1,
542
539
"chat_template": ""
543
540
}
544
541
```
545
542
546
-
-`system_prompt` - the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
547
543
-`default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
548
544
-`total_slots` - the total number of slots for process requests (defined by `--parallel` option)
549
545
-`chat_template` - the model's original Jinja2 prompt template
@@ -554,7 +550,7 @@ To use this endpoint with POST method, you need to start server with `--props`
554
550
555
551
*Options:*
556
552
557
-
-`system_prompt`: Change the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
553
+
-None yet
558
554
559
555
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
0 commit comments