[pull] master from ggerganov:master #139

pull · 2024-08-06T19:19:23Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

* [example] batched-bench "segmentation fault" When `llama-batched-bench` is invoked _without_ setting `-npl`, "number of parallel prompts", it segfaults. The segfault is caused by invoking `max_element()` on a zero-length vector, `n_pl` This commit addresses that by first checking to see if the number of parallel prompts is zero, and if so sets the maximum sequence size to 1; otherwise, sets it to the original, the result of `max_element()`. Fixes, when running `lldb build/bin/llama-batched-bench -- -m models/Meta-Llama-3-8B.gguf` ``` * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x000000010000366c llama-batched-bench`main(argc=3, argv=0x000000016fdff268) at batched-bench.cpp:72:28 69 llama_context_params ctx_params = llama_context_params_from_gpt_params(params); 70 71 // ensure enough sequences are available -> 72 ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end()); ``` * Update examples/batched-bench/batched-bench.cpp Co-authored-by: compilade <git@compilade.net> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>

* Don't ignore llama.cpp params * Add fallback for max_tokens

This commit moves the comment for the c parameter from ggml_rope to ggml_rope_ext. The comment is currently incorrect as ggml_rope does not have a c parameter (freq_factors tensor). Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>

* Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op

ggml-ci

* Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered

* Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis

… Llama 3.1 tool call support (#8858) * gguf-py, llama : add constants and methods related to Llama-3.1 <|eom_id|> token * llama : find Llama-3.1 <|eom_id|> token id during vocab loading * llama-vocab : add Llama-3.1 <|eom_id|> token to the set of tokens stopping the generation --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

* common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model *, struct llama_context *> * delete llama_init_default_params() * delete the extra whitespace

* cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor

* add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

…e31a4f6` (#8880) * Fix compilation issue in `vulkan-shaders-gen` e31a4f6 broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`

When using CMake to build with Vulkan support, compiling vulkan-shaders-gen fails due to missing a CMakeLists.txt specification to link vulkan-shaders-gen with the threading library, resulting in the following error. [5/172] Linking CXX executable bin/vulkan-shaders-gen FAILED: bin/vulkan-shaders-gen : && /usr/bin/c++ ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o -o bin/vulkan-shaders-gen && : ld: error: undefined symbol: pthread_create >>> referenced by vulkan-shaders-gen.cpp >>> ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o:(std::__1::__libcpp_thread_create[abi:se180100](pthread**, >>> void* (*)(void*), void*)) c++: error: linker command failed with exit code 1 (use -v to see invocation) [6/172] Generating build details from Git -- Found Git: /usr/local/bin/git (found version "2.45.2") ninja: build stopped: subcommand failed. Add the CMakeLists.txt specification to link vulkan-shaders-gen with the threading library and fix the above error. Fixes #8834

This commit updates the name of the executable in README.md from `simple` to `llama-simple`.

* server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style

This commit updates the usage comment in quantize.cpp to reflect the new name of the executable, which is llama-quantize.

* Add support for getting cpu info on Windows for llama_bench * refactor --------- Co-authored-by: slaren <slarengh@gmail.com>

* Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

* make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm

ggerganov and others added 29 commits August 3, 2024 19:53

flake.lock: Update (#8847)

4b77ea9

baby-llama : remove duplicate vector include

01aae2b

Server: Don't ignore llama.cpp params (#8754)

978ba3d

* Don't ignore llama.cpp params * Add fallback for max_tokens

Install curl in runtime layer (#8693)

0d6fb52

cann: support q4_0 model (#8822)

c02b0a8

sync : ggml

5587e57

ggml-ci

vulkan : fix Qantized Mat-Vec Mul on AMD GPUs for ncols < 64 (#8855)

064cdc2

* Fix Vulkan mul mat vec invalid results when ncols < warp size * Only run backend ops mul mat vec block size test if block size not already covered

llama : better replace_all (#8852)

f1ea514

readme : update model list (#8851)

400ae6f

cmake: fix paths for vulkan shaders compilation on Windows (#8573)

e31a4f6

* Vulkan-shaders: attempt fix compilation on windows * fix miss-matched parenthesis

py: Add more authorship metadata from model card (#8810)

1ef14b3

* py: add more authorship metadata from model card * fixup! py: add more authorship metadata from model card

ggml : fix overflows in elu function (#8866)

b9dfc25

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

readme : add ramalama to the availables UI (#8811)

b42978e

ramalama is a repo agnostic boring CLI tool that supports pulling from ollama, huggingface and oci registries. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

cann: fix buffer_num and runtime speed slowly error (#8865)

bc0f887

common : Changed tuple to struct (TODO fix) (#8823)

0a4ce78

* common : Changed tuple to struct (TODO fix) Use struct `llama_init_result` to replace the previous std::tuple<struct llama_model *, struct llama_context *> * delete llama_init_default_params() * delete the extra whitespace

[SYCL] correct cmd name (#8877)

d4ff847

[CANN]: Fix ggml_backend_cann_buffer_get_tensor (#8871)

c21a896

* cann: fix ggml_backend_cann_buffer_get_tensor 1. fix data ptr offset 2. enable the acquisition of incomplete tensors * fix backend cann set_tensor

convert : add support for XLMRoberta embedding models (#8658)

cdd1889

* add conversion for bge-m3; small fix in unigram tokenizer * clean up and simplify XLMRoberta conversion

ggml : add epsilon as a parameter for group_norm (#8818)

2d5dd7b

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

contributing : add note about write access

0bf16de

[Vulkan] Fix compilation of vulkan-shaders-gen on w64devkit after `…

efda90c

…e31a4f6` (#8880) * Fix compilation issue in `vulkan-shaders-gen` e31a4f6 broke compilation on w64devkit. Including `algorithm` seems to fix that. * Guard it under `#ifdef _WIN32`

simple : update name of executable to llama-simple (#8885)

5f4dcb1

This commit updates the name of the executable in README.md from `simple` to `llama-simple`.

CUDA: fix padding logic for FP16/FP32 (#8884)

641f5dd

server : add lora hotswap endpoint (WIP) (#8857)

1e6f655

* server : add lora hotswap endpoint * handle lora_no_apply * fix build * updae docs * clean up struct def * fix build * add LoRA test * fix style

pull bot added the ⤵️ pull label Aug 6, 2024

github-actions bot added examples devops python server ggml Vulkan SYCL Nvidia GPU testing script labels Aug 6, 2024

Nexesenex and others added 7 commits August 7, 2024 01:41

typo correction (#8891)

3195854

quantize : update usage comment in quantize.cpp (#8889)

725e3d9

This commit updates the usage comment in quantize.cpp to reflect the new name of the executable, which is llama-quantize.

llama-bench : add support for getting cpu info on Windows (#8824)

506122d

* Add support for getting cpu info on Windows for llama_bench * refactor --------- Co-authored-by: slaren <slarengh@gmail.com>

CUDA/HIP: fix tests/test-backend-ops (#8896)

a8dbc6f

[SYCL] Updated SYCL device filtering (#8901)

0478174

* Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme

ggml-backend : fix async copy from CPU (#8897)

be55695

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

make : use C compiler to build metal embed object (#8899)

15fa07a

* make : use C compiler to build metal embed object * use rm + rmdir to avoid -r flag in rm

teleprint-me closed this Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggerganov:master #139

[pull] master from ggerganov:master #139

pull bot commented Aug 6, 2024 •

edited

Loading

[pull] master from ggerganov:master #139

[pull] master from ggerganov:master #139

Conversation

pull bot commented Aug 6, 2024 • edited Loading

pull bot commented Aug 6, 2024 •

edited

Loading