forked from ggerganov/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from ggerganov:master #142
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* server : fix segfault on long system prompt * server : fix parallel generation with very small batch sizes * server : fix typo in comment
* Optimize Vulkan REPEAT performance * Use Vulkan GLSL fused multiply-add instruction where possible * Add GGML_VULKAN_PERF option to output performance data per operator * Rework and fix Vulkan descriptor set and descriptor pool handling * Fix float32 concat f16 shader validation error * Add Vulkan GROUP_NORM eps parameter * Fix validation error with transfer queue memory barrier flags * Remove trailing whitespaces
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
* retrieval * Reuse querybatch to reduce frequent memory allocation * delete unused white space
* ggml : Dynamic ggml_sched_max_splits based on graph_size * Fixed and readded debug code for causes
* Add nemotron GGUF conversion & inference support * Fix formatting issues * Remove unnecessary write_tensors() * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Address comments by @compilade * Replace ggml_mul_mat()->llm_build_lora_mm() * Remove mutable variable * Use for bias tensors * Cover corner case for role_scaling not in config.json --------- Co-authored-by: compilade <git@compilade.net>
* Add support for cpu_get_num_phsical_cores() on Windows * fix build bug on msys2-clang64 and ucrt64 * avoid adding new function * add new macros to avoid windows+mingw64 * Add error checking to return default value
* add exaone model support * add chat template * fix whitespace Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add ftype * add exaone pre-tokenizer in `llama-vocab.cpp` Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * fix lint Co-Authored-By: compilade <113953597+compilade@users.noreply.github.com> * add `EXAONE` to supported models in `README.md` * fix space Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <113953597+compilade@users.noreply.github.com> Co-authored-by: compilade <git@compilade.net>
Signed-off-by: Aisuko <urakiny@gmail.com>
Co-authored-by: farbod <farbod.bjary82@gmail.com>
* init * rename * add run android for termux in readme * add android readme * add instructions in readme * change name in readme * Update README.md * fixed line * add result in readme * random pos_embed * add positions index * change for ollama * change for ollama * better pos_embed in clip * support ollama * updata cmakelist * updata cmakelist * rename wrapper * clear code * replace and organize code * add link * sync master * fix warnings * fix warnings * fix bug in bicubic resize when need resize iamge smaller * receive review comments and modify * receive review comments and modify * put all code into llava dir * fix quality problem in pr code * change n_layer * add space in "-1" * imitate reshape bug of python code * fix bug in clip * fix issues for merging * fix llama-minicpmv-cli in cmake file * change pr readme * fix code review * remove in line 33 directory in the /cmakelists.txt (not in example, in the main dir * fix cmakefile * add warn * fix KEY_HAS_MINICPMV_PROJ * remove load_image_size into clip_ctx * remove the extern "C", MINICPMV_API * fix uhd code for review comment * delete minicpmv-wrapper in pr * remove uhd_image_embed * Modify 2 notes * support minicpmv2.6 * modify convert script of minicpmv * modify convert * modify convert * add readme * add resampler of v2.6 * modify clip * modify readme * fix type-check * fix type-check * fix type-check * fix type-check * modify convert script and readme * fix convert script and readme * fix convert * fix num in convert * fix type-check --------- Co-authored-by: Hongji Zhu <fireyoucan@gmail.com> Co-authored-by: harvestingmoon <leewenyeong@gmail.com>
* server : refactor middleware and /health endpoint * move "fail_on_no_slot" to /slots * Update examples/server/server.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix server tests * fix CI * update server docs --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add printing to check weights match torch version * minor code style changes --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Add more checks which prevent RPC server from crashing if invalid input is received from client
Co-authored-by: xuedinge233 <damow890@gmail.com> Co-authored-by: hipudding <huafengchun@gmail.com>
github-actions
bot
added
the
documentation
Improvements or additions to documentation
label
Aug 19, 2024
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* sycl: fix im2col overflow and sync with cuda Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert overflow Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix convert and dequantize Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: fix ib in dmmv Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl:refine convert Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * sycl: move downsample global_range into common Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: add im2col and convert test cases Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: make new cases only in sycl Signed-off-by: zhentaoyu <zhentao.yu@intel.com> * test: comment new test_cases for only local testing Signed-off-by: zhentaoyu <zhentao.yu@intel.com> --------- Signed-off-by: zhentaoyu <zhentao.yu@intel.com>
* fallback mmvq to mul_mat * mmvq in cuda path * Update ggml/src/ggml-sycl.cpp Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com> --------- Co-authored-by: Alberto Cabrera Pérez <alberto.cabrera@codeplay.com>
…LAVA CLIP model. (#8984) * llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. - The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available. - A GGML_OP_ACC shader has been added. - The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * fix-up coding style. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * Fix-up the missing initial parameter to resolve the compilation warning. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Add missing parameters. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * [fix] Use nb1 and nb2 for dst. Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> * Fix check results ggml_acc call --------- Signed-off-by: Changyeon Kim <cyzero.kim@samsung.com> Co-authored-by: 0cc4m <picard12@live.de>
* llama : std::move llm_bigram_bpe from work_queue This commit updates the retrieval of llm_bigram_bpe objects from work_queue.top() by using std::move. The motivation for this is to avoid the copying of the std::string `text` member of the llm_bigram_bpe struct. * squash! llama : std::move llm_bigram_bpe from work_queue Introduced a MovablePriorityQueue class to allow moving elements out of the priority queue for llm_bigram_bpe. * squash! llama : std::move llm_bigram_bpe from work_queue Rename MovablePriorityQueue to lama_priority_queue. * squash! llama : std::move llm_bigram_bpe from work_queue Rename lama_priority_queue -> llama_priority_queue.
…ialization 908) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* feat: initial support for llama.cpp * fix: lint * refactor: better refactor * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * fix: address comments * Update convert_hf_to_gguf.py Co-authored-by: compilade <git@compilade.net> * fix: add more cleanup and harmonization * fix: lint * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <git@compilade.net> * fix: change name * Apply suggestions from code review Co-authored-by: compilade <git@compilade.net> * add in operator * fix: add `dt_b_c_rms` in `llm_load_print_meta` * fix: correct printf format for bool * fix: correct print format * Update src/llama.cpp Co-authored-by: compilade <git@compilade.net> * llama : quantize more Mamba tensors * llama : use f16 as the fallback of fallback quant types --------- Co-authored-by: compilade <git@compilade.net>
* server : support reading arguments from environment variables * add -fa and -dt * readme : specify non-arg env var
* llama : advanced batch splits This includes equal-sequence-length batch splits which are useful to simplify recurrent model operators. * llama : always make recurrent state slots contiguous * ggml : simplify mamba operators * llama : fix integer signedness mixing * llama : logits_all has priority over batch->logits Otherwise, the server embeddings tests failed. This was likely an existing problem but was only detected here because of an additional assertion. * llama : apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix t5 segfault * llama : fix Mamba session save and restore * llama : minor cosmetic changes * llama : rename llama_reorder_outputs to llama_output_reorder Also move it closer to llama_output_reserve. * llama : fix pooled embeddings when using batches with equal_seqs * minor : add struct members for clarity ggml-ci * llama : fix T5 segfault again * llama : fix Mamba pooled embeddings with multiple sequences Until the pooled embeddings are refactored to allow splitting across ubatches for causal embeddings, recurrent models can only process a single sequence per ubatch when calculating pooled embeddings. * llama : add llama_model_is_recurrent to simplify figuring that out This will make it easier to more cleanly support RWKV-v6 and Mamba-2. * llama : fix simple splits when the batch contains embeddings --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add onednn * add sycl_f16 * add dnnl stream * add engine map * use dnnl for intel only * use fp16fp16fp16 * update doc
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )