Skip to content

Commit b4c9911

Browse files
committed
Merge branch 'master' into xsn/llama_batch_remove_compat
2 parents 0639ff1 + edc2656 commit b4c9911

File tree

19 files changed

+773
-725
lines changed

19 files changed

+773
-725
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ variety of hardware - locally and in the cloud.
3131
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
3232
- AVX, AVX2 and AVX512 support for x86 architectures
3333
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
34-
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
34+
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
3535
- Vulkan and SYCL backend support
3636
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
3737

@@ -413,7 +413,7 @@ Please refer to [Build llama.cpp locally](./docs/build.md)
413413
| [BLAS](./docs/build.md#blas-build) | All |
414414
| [BLIS](./docs/backend/BLIS.md) | All |
415415
| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
416-
| [MUSA](./docs/build.md#musa) | Moore Threads GPU |
416+
| [MUSA](./docs/build.md#musa) | Moore Threads MTT GPU |
417417
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
418418
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
419419
| [Vulkan](./docs/build.md#vulkan) | GPU |

common/arg.cpp

Lines changed: 112 additions & 155 deletions
Large diffs are not rendered by default.

common/common.cpp

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212

1313
#include <algorithm>
1414
#include <cinttypes>
15+
#include <climits>
1516
#include <cmath>
1617
#include <codecvt>
1718
#include <cstdarg>
@@ -23,10 +24,10 @@
2324
#include <regex>
2425
#include <sstream>
2526
#include <string>
27+
#include <thread>
2628
#include <unordered_map>
2729
#include <unordered_set>
2830
#include <vector>
29-
#include <thread>
3031

3132
#if defined(__APPLE__) && defined(__MACH__)
3233
#include <sys/types.h>
@@ -400,6 +401,21 @@ std::string common_params_get_system_info(const common_params & params) {
400401
// String utils
401402
//
402403

404+
std::string string_format(const char * fmt, ...) {
405+
va_list ap;
406+
va_list ap2;
407+
va_start(ap, fmt);
408+
va_copy(ap2, ap);
409+
int size = vsnprintf(NULL, 0, fmt, ap);
410+
GGML_ASSERT(size >= 0 && size < INT_MAX); // NOLINT
411+
std::vector<char> buf(size + 1);
412+
int size2 = vsnprintf(buf.data(), size + 1, fmt, ap2);
413+
GGML_ASSERT(size2 == size);
414+
va_end(ap2);
415+
va_end(ap);
416+
return std::string(buf.data(), size);
417+
}
418+
403419
std::vector<std::string> string_split(std::string input, char separator) {
404420
std::vector<std::string> parts;
405421
size_t separator_pos = input.find(separator);

common/common.h

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,6 @@ struct common_params {
282282
std::string hostname = "127.0.0.1";
283283
std::string public_path = ""; // NOLINT
284284
std::string chat_template = ""; // NOLINT
285-
std::string system_prompt = ""; // NOLINT
286285
bool enable_chat_template = true;
287286

288287
std::vector<std::string> api_keys;
@@ -352,15 +351,28 @@ void common_init();
352351

353352
std::string common_params_get_system_info(const common_params & params);
354353

355-
bool parse_cpu_range(const std::string& range, bool(&boolmask)[GGML_MAX_N_THREADS]);
356-
bool parse_cpu_mask(const std::string& mask, bool(&boolmask)[GGML_MAX_N_THREADS]);
357-
void postprocess_cpu_params(cpu_params& cpuparams, const cpu_params* role_model = nullptr);
354+
bool parse_cpu_range(const std::string & range, bool(&boolmask)[GGML_MAX_N_THREADS]);
355+
bool parse_cpu_mask(const std::string & mask, bool(&boolmask)[GGML_MAX_N_THREADS]);
356+
void postprocess_cpu_params(cpu_params & cpuparams, const cpu_params * role_model = nullptr);
358357
bool set_process_priority(enum ggml_sched_priority prio);
359358

360359
//
361360
// String utils
362361
//
363362

363+
#ifdef __GNUC__
364+
#ifdef __MINGW32__
365+
#define LLAMA_COMMON_ATTRIBUTE_FORMAT(...) __attribute__((format(gnu_printf, __VA_ARGS__)))
366+
#else
367+
#define LLAMA_COMMON_ATTRIBUTE_FORMAT(...) __attribute__((format(printf, __VA_ARGS__)))
368+
#endif
369+
#else
370+
#define LLAMA_COMMON_ATTRIBUTE_FORMAT(...)
371+
#endif
372+
373+
LLAMA_COMMON_ATTRIBUTE_FORMAT(1, 2)
374+
std::string string_format(const char * fmt, ...);
375+
364376
std::vector<std::string> string_split(std::string input, char separator);
365377

366378
std::string string_strip(const std::string & str);

docs/build.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,8 @@ The following compilation options are also available to tweak performance:
198198
199199
### MUSA
200200
201+
This provides GPU acceleration using the MUSA cores of your Moore Threads MTT GPU. Make sure to have the MUSA SDK installed. You can download it from here: [MUSA SDK](https://developer.mthreads.com/sdk/download/musa).
202+
201203
- Using `make`:
202204
```bash
203205
make GGML_MUSA=1
@@ -209,6 +211,12 @@ The following compilation options are also available to tweak performance:
209211
cmake --build build --config Release
210212
```
211213
214+
The environment variable [`MUSA_VISIBLE_DEVICES`](https://docs.mthreads.com/musa-sdk/musa-sdk-doc-online/programming_guide/Z%E9%99%84%E5%BD%95/) can be used to specify which GPU(s) will be used.
215+
216+
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted.
217+
218+
Most of the compilation options available for CUDA should also be available for MUSA, though they haven't been thoroughly tested yet.
219+
212220
### hipBLAS
213221

214222
This provides BLAS acceleration on HIP-supported AMD GPUs.

examples/infill/infill.cpp

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -205,11 +205,11 @@ int main(int argc, char ** argv) {
205205
std::vector<llama_token> inp_pfx = common_tokenize(ctx, params.input_prefix, false);
206206
std::vector<llama_token> inp_sfx = common_tokenize(ctx, params.input_suffix, false);
207207

208-
GGML_ASSERT(llama_token_prefix(model) >= 0);
209-
GGML_ASSERT(llama_token_suffix(model) >= 0);
208+
GGML_ASSERT(llama_token_fim_pre(model) >= 0);
209+
GGML_ASSERT(llama_token_fim_suf(model) >= 0);
210210

211-
inp_pfx.insert(inp_pfx.begin(), llama_token_prefix(model));
212-
inp_sfx.insert(inp_sfx.begin(), llama_token_suffix(model));
211+
inp_pfx.insert(inp_pfx.begin(), llama_token_fim_pre(model));
212+
inp_sfx.insert(inp_sfx.begin(), llama_token_fim_suf(model));
213213

214214
embd_inp = params.spm_infill ? inp_sfx : inp_pfx;
215215
embd_end = params.spm_infill ? inp_pfx : inp_sfx;
@@ -218,7 +218,7 @@ int main(int argc, char ** argv) {
218218
}
219219
embd_inp.insert(embd_inp.end(), embd_end.begin(), embd_end.end());
220220

221-
const llama_token middle_token = llama_token_middle(model);
221+
const llama_token middle_token = llama_token_fim_mid(model);
222222
if (middle_token >= 0) {
223223
embd_inp.push_back(middle_token);
224224
}
@@ -508,8 +508,8 @@ int main(int argc, char ** argv) {
508508
std::vector<llama_token> inp_pfx = common_tokenize(ctx, params.input_prefix, false);
509509
std::vector<llama_token> inp_sfx = common_tokenize(ctx, params.input_suffix, false);
510510

511-
inp_pfx.insert(inp_pfx.begin(), llama_token_prefix(model));
512-
inp_sfx.insert(inp_sfx.begin(), llama_token_suffix(model));
511+
inp_pfx.insert(inp_pfx.begin(), llama_token_fim_pre(model));
512+
inp_sfx.insert(inp_sfx.begin(), llama_token_fim_suf(model));
513513

514514
embd_inp = params.spm_infill ? inp_sfx : inp_pfx;
515515
embd_end = params.spm_infill ? inp_pfx : inp_sfx;

examples/server/README.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,6 @@ The project is under active development, and we are [looking for feedback and co
6060
| `--yarn-attn-factor N` | YaRN: scale sqrt(t) or attention magnitude (default: 1.0)<br/>(env: LLAMA_ARG_YARN_ATTN_FACTOR) |
6161
| `--yarn-beta-slow N` | YaRN: high correction dim or alpha (default: 1.0)<br/>(env: LLAMA_ARG_YARN_BETA_SLOW) |
6262
| `--yarn-beta-fast N` | YaRN: low correction dim or beta (default: 32.0)<br/>(env: LLAMA_ARG_YARN_BETA_FAST) |
63-
| `-gan, --grp-attn-n N` | group-attention factor (default: 1)<br/>(env: LLAMA_ARG_GRP_ATTN_N) |
64-
| `-gaw, --grp-attn-w N` | group-attention width (default: 512.0)<br/>(env: LLAMA_ARG_GRP_ATTN_W) |
6563
| `-dkvc, --dump-kv-cache` | verbose print of the KV cache |
6664
| `-nkvo, --no-kv-offload` | disable KV offload<br/>(env: LLAMA_ARG_NO_KV_OFFLOAD) |
6765
| `-ctk, --cache-type-k TYPE` | KV cache data type for K (default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_K) |
@@ -149,7 +147,6 @@ The project is under active development, and we are [looking for feedback and co
149147
| `--ssl-cert-file FNAME` | path to file a PEM-encoded SSL certificate<br/>(env: LLAMA_ARG_SSL_CERT_FILE) |
150148
| `-to, --timeout N` | server read/write timeout in seconds (default: 600)<br/>(env: LLAMA_ARG_TIMEOUT) |
151149
| `--threads-http N` | number of threads used to process HTTP requests (default: -1)<br/>(env: LLAMA_ARG_THREADS_HTTP) |
152-
| `-spf, --system-prompt-file FNAME` | set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications |
153150
| `--metrics` | enable prometheus compatible metrics endpoint (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_METRICS) |
154151
| `--slots` | enable slots monitoring endpoint (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_SLOTS) |
155152
| `--props` | enable changing global properties via POST /props (default: disabled)<br/>(env: LLAMA_ARG_ENDPOINT_PROPS) |
@@ -320,7 +317,6 @@ node index.js
320317

321318
- The prompt is a string or an array with the first element given as a string
322319
- The model's `tokenizer.ggml.add_bos_token` metadata is `true`
323-
- The system prompt is empty
324320

325321
`temperature`: Adjust the randomness of the generated text. Default: `0.8`
326322

@@ -378,6 +374,8 @@ node index.js
378374

379375
`min_keep`: If greater than 0, force samplers to return N possible tokens at minimum. Default: `0`
380376

377+
`t_max_predict_ms`: Set a time limit in milliseconds for the prediction (a.k.a. text-generation) phase. The timeout will trigger if the generation takes more than the specified time (measured since the first token was generated) and if a new-line character has already been generated. Useful for FIM applications. Default: `0`, which is disabled.
378+
381379
`image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `prompt`. You can determine the place of the image in the prompt as in the following: `USER:[img-12]Describe the image in detail.\nASSISTANT:`. In this case, `[img-12]` will be replaced by the embeddings of the image with id `12` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 12}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.
382380

383381
`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot. Default: `-1`
@@ -526,7 +524,7 @@ Takes a prefix and a suffix and returns the predicted completion as stream.
526524
- `input_prefix`: Set the prefix of the code to infill.
527525
- `input_suffix`: Set the suffix of the code to infill.
528526

529-
It also accepts all the options of `/completion` except `stream` and `prompt`.
527+
It also accepts all the options of `/completion`.
530528

531529
### **GET** `/props`: Get server global properties.
532530

@@ -536,14 +534,12 @@ This endpoint is public (no API key check). By default, it is read-only. To make
536534

537535
```json
538536
{
539-
"system_prompt": "",
540537
"default_generation_settings": { ... },
541538
"total_slots": 1,
542539
"chat_template": ""
543540
}
544541
```
545542

546-
- `system_prompt` - the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
547543
- `default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
548544
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
549545
- `chat_template` - the model's original Jinja2 prompt template
@@ -554,7 +550,7 @@ To use this endpoint with POST method, you need to start server with `--props`
554550

555551
*Options:*
556552

557-
- `system_prompt`: Change the system prompt (initial prompt of all slots). Please note that this does not take into account the chat template. It will append the prompt at the beginning of formatted prompt.
553+
- None yet
558554

559555
### POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API
560556

0 commit comments

Comments
 (0)