[pull] master from ggerganov:master #142

pull · 2024-08-14T09:32:39Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

* server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment

* Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>

* retrieval * Reuse querybatch to reduce frequent memory allocation * delete unused white space

* ggml : Dynamic ggml_sched_max_splits based on graph_size * Fixed and readded debug code for causes

@compilade

* Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>

* Add support for cpu_get_num_phsical_cores() on Windows * fix build bug on msys2-clang64 and ucrt64 * avoid adding new function * add new macros to avoid windows+mingw64 * Add error checking to return default value

* add exaone model support * add chat template * fix whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add ftype * add exaone pre-tokenizer in `llama-vocab.cpp` Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * fix lint Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * add `EXAONE` to supported models in `README.md` * fix space Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: compilade <git@compilade.net>

Signed-off-by: Aisuko <urakiny@gmail.com>

Co-authored-by: farbod <farbod.bjary82@gmail.com>

* init * rename * add run android for termux in readme * add android readme * add instructions in readme * change name in readme * Update README.md * fixed line * add result in readme * random pos_embed * add positions index * change for ollama * change for ollama * better pos_embed in clip * support ollama * updata cmakelist * updata cmakelist * rename wrapper * clear code * replace and organize code * add link * sync master * fix warnings * fix warnings * fix bug in bicubic resize when need resize iamge smaller * receive review comments and modify * receive review comments and modify * put all code into llava dir * fix quality problem in pr code * change n_layer * add space in "-1" * imitate reshape bug of python code * fix bug in clip * fix issues for merging * fix llama-minicpmv-cli in cmake file * change pr readme * fix code review * remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir * fix cmakefile * add warn * fix KEY_HAS_MINICPMV_PROJ * remove load_image_size into clip_ctx * remove the extern "C", MINICPMV_API * fix uhd code for review comment * delete minicpmv-wrapper in pr * remove uhd_image_embed * Modify 2 notes * support minicpmv2.6 * modify convert script of minicpmv * modify convert * modify convert * add readme * add resampler of v2.6 * modify clip * modify readme * fix type-check * fix type-check * fix type-check * fix type-check * modify convert script and readme * fix convert script and readme * fix convert * fix num in convert * fix type-check --------- Co-authored-by: Hongji Zhu <fireyoucan@gmail.com> Co-authored-by: harvestingmoon <leewenyeong@gmail.com>

* server : refactor middleware and /health endpoint * move "fail_on_no_slot" to /slots * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix server tests * fix CI * update server docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add printing to check weights match torch version * minor code style changes --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Add more checks which prevent RPC server from crashing if invalid input is received from client

Co-authored-by: xuedinge233 <damow890@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* sycl: fix im2col overflow and sync with cuda Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert overflow Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert and dequantize Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix ib in dmmv Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl:refine convert Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: move downsample global_range into common Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: add im2col and convert test cases Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: make new cases only in sycl Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: comment new test_cases for only local testing Signed-off-by: zhentaoyu <zhentao.yu@intel.com> --------- Signed-off-by: zhentaoyu <zhentao.yu@intel.com>

* fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>

…LAVA CLIP model. (#8984) * llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. - The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available. - A GGML_OP_ACC shader has been added. - The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * fix-up coding style. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * Fix-up the missing initial parameter to resolve the compilation warning. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Add missing parameters. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Use nb1 and nb2 for dst. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * Fix check results ggml_acc call --------- Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> Co-authored-by: 0cc4m <picard12@live.de>

* llama : std::move llm_bigram_bpe from work_queue This commit updates the retrieval of llm_bigram_bpe objects from work_queue.top() by using std::move. The motivation for this is to avoid the copying of the std::string `text` member of the llm_bigram_bpe struct. * squash! llama : std::move llm_bigram_bpe from work_queue Introduced a MovablePriorityQueue class to allow moving elements out of the priority queue for llm_bigram_bpe. * squash! llama : std::move llm_bigram_bpe from work_queue Rename MovablePriorityQueue to lama_priority_queue. * squash! llama : std::move llm_bigram_bpe from work_queue Rename lama_priority_queue -> llama_priority_queue.

…ialization 908) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* feat: initial support for llama.cpp * fix: lint * refactor: better refactor * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix: address comments * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * fix: add more cleanup and harmonization * fix: lint * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * fix: change name * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * add in operator * fix: add `dt_b_c_rms` in `llm_load_print_meta` * fix: correct printf format for bool * fix: correct print format * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama : quantize more Mamba tensors * llama : use f16 as the fallback of fallback quant types --------- Co-authored-by: compilade <git@compilade.net>

* server : support reading arguments from environment variables * add -fa and -dt * readme : specify non-arg env var

* llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc

ggerganov and others added 2 commits August 14, 2024 09:14

cmake : remove unused option GGML_CURL (#9011)

43bdd3c

server : fix segfault on long system prompt (#8987)

98a532d

* server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment

github-actions bot added examples server labels Aug 14, 2024

pull bot added ⤵️ pull and removed examples server labels Aug 14, 2024

github-actions bot added examples server ggml Vulkan labels Aug 14, 2024

jpodivin and others added 2 commits August 15, 2024 09:21

server : init stop and error fields of the result struct (#9026)

234b306

Signed-off-by: Jiri Podivin <jpodivin@redhat.com>

ci : disable bench workflow (#9010)

d5492f0

github-actions bot added the devops label Aug 15, 2024

llama : add pre-tokenizer regexes for BLOOM and gpt3-finnish (#8850)

6bda7ce

github-actions bot added the python label Aug 15, 2024

kylo5aby and others added 13 commits August 15, 2024 10:23

common : remove duplicate function llama_should_add_bos_token (#8778)

4af8420

server : fix duplicated n_predict key in the generation_settings (#8994)

37501d9

retrieval : fix memory leak in retrieval query handling (#8955)

4b9afbb

* retrieval * Reuse querybatch to reduce frequent memory allocation * delete unused white space

ggml : dynamic ggml_sched_max_splits based on graph_size (#9047)

e3f6fd5

* ggml : Dynamic ggml_sched_max_splits based on graph_size * Fixed and readded debug code for causes

gguf-py : bump version from 0.9.1 to 0.10.0 (#9051)

23fd453

Fix inference example lacks required parameters (#9035)

c8ddce8

Signed-off-by: Aisuko <urakiny@gmail.com>

py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928)

ee2984b

Co-authored-by: farbod <farbod.bjary82@gmail.com>

Fix incorrect use of ctx_split for bias tensors (#9063)

2fb9267

tests : add integration test for lora adapters (#8957)

2339a0b

* Add printing to check weights match torch version * minor code style changes --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

github-actions bot added the testing label Aug 18, 2024

ggerganov and others added 4 commits August 18, 2024 07:43

flake.lock: Update (#9068)

554b049

rpc : prevent crashes on invalid input (#9040)

18eaf29

Add more checks which prevent RPC server from crashing if invalid input is received from client

rpc : print error message when failed to connect endpoint (#9042)

1b6ff90

cann: add doc for cann backend (#8867)

cfac111

Co-authored-by: xuedinge233 <damow890@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>

github-actions bot added the documentation Improvements or additions to documentation label Aug 19, 2024

fairydreaming and others added 2 commits August 20, 2024 12:09

tests : add missing comma in grammar integration tests (#9099)

90db814

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

github-actions bot added the SYCL label Aug 20, 2024

airMeng and others added 8 commits August 20, 2024 23:50

[SYCL] fallback mmvq (#9088)

50addec

* fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>

llava : zero-initialize clip_ctx structure fields with aggregate init…

f63f603

…ialization 908) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

server : support reading arguments from environment variables (#9105)

fc54ef0

* server : support reading arguments from environment variables * add -fa and -dt * readme : specify non-arg env var

[SYCL] Add oneDNN primitive support (#9091)

1731d42

* add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc

github-actions bot added the build label Aug 22, 2024

teleprint-me closed this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggerganov:master #142

[pull] master from ggerganov:master #142

pull bot commented Aug 14, 2024 •

edited

Loading

[pull] master from ggerganov:master #142

[pull] master from ggerganov:master #142

Conversation

pull bot commented Aug 14, 2024 • edited Loading

pull bot commented Aug 14, 2024 •

edited

Loading