Releases: ggerganov/llama.cpp
Releases · ggerganov/llama.cpp
b4103
llama/ex: remove --logdir argument (#10339)
b4102
llamafile : fix include path (#0) ggml-ci
b4100
server: (web UI) Add samplers sequence customization (#10255) * Samplers sequence: simplified and input field. * Removed unused function * Modify and use `settings-modal-short-input` * rename "name" --> "label" --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b4098
vulkan: Optimize some mat-vec mul quant shaders (#10296) Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops. Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.
b4095
llama : save number of parameters and the size in llama_model (#10286) fixes #10285
b4094
Make updates to fix issues with clang-cl builds while using AVX512 fl…
b4092
ggml : fix some build issues
b4091
cmake : fix ppc64 check (whisper/0) ggml-ci
b4088
AVX BF16 and single scale quant optimizations (#10212) * use 128 bit loads (i've tried 256->128 to death and its slower) * double accumulator * avx bf16 vec dot * +3% q4_0 inference * +7% tg +5% pp compared to master * slower f16c version, kep for reference * 256b version, also slow. i tried :) * revert f16 * faster with madd * split to functions * Q8_0 and IQ4_NL, 5-7% faster * fix potential overflow (performance reduced) * 16 bit add for q4_0 only * merge
b4087
ci: build test musa with cmake (#10298) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>