Fir 1035: Update Llama.cpp to latest and add all the new dependencies in tool chain #74

atrivedi-tsavoritesi · 2025-10-29T14:16:27Z

Update LLAMA.cpp to latest, add libcurl and libssl_dev and other dependent libraries to support the merge.

[atrivedi@ws01 workspace]$ ls /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/lib -lt
total 41840
-rw-r--r-- 1 atrivedi tsiusers 730992 Oct 29 06:38 libzstd.so.1.4.8
-rw-r--r-- 1 atrivedi tsiusers 730992 Oct 29 06:38 libzstd.so.1
-rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so.1.3.1
-rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so.1
-rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so
-rw-r--r-- 1 atrivedi tsiusers 128580 Oct 29 06:38 libz.a
-rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so.5.2.0
-rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so.5
-rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so
-rwxr-xr-x 1 atrivedi tsiusers 1196040 Oct 29 06:38 libssl.so.3
-rw-r--r-- 1 atrivedi tsiusers 132968 Oct 29 06:38 libidn2.so.0.4.0
-rw-r--r-- 1 atrivedi tsiusers 132968 Oct 29 06:38 libidn2.so.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.7.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.6.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.5.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.4.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.3.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.2.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.1.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.0.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.3
-rwxr-xr-x 1 atrivedi tsiusers 666912 Oct 29 06:38 libcurl.so
-rw-r--r-- 1 atrivedi tsiusers 1066 Oct 29 06:38 libcurl.la
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.8.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.7.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.6.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.5.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.4.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.3.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.2.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.1.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.0.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.3
-rw-r--r-- 1 atrivedi tsiusers 776948 Oct 29 06:38 libcurl.a
-rwxr-xr-x 1 atrivedi tsiusers 6302760 Oct 29 06:38 libcrypto.so.3
-rw-r--r-- 1 atrivedi tsiusers 2612824 Oct 29 06:38 libcrypto.so.1.1

Test result on FPGA3
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 117.57 ms / 11 runs ( 10.69 ms per token, 93.56 tokens per second)
llama_perf_context_print: load time = 54557.03 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 82307.61 ms / 4 runs (20576.90 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 92882.72 ms / 5 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 440 440 1306380 2969.05
MUL OPU 450 450 667893 1484.21
RMS_NORM OPU 450 450 1092385 2427.52
MUL_MAT CPU 7848 0 588990168 75049.72
CONT CPU 1301 0 1511420 1161.74
RESHAPE CPU 1118 0 20732 18.54
VIEW CPU 1857 0 2588 1.39
PERMUTE CPU 1315 0 2365 1.80
TRANSPOSE CPU 369 0 791 2.14
GET_ROWS CPU 75 0 20576 274.35
SET_ROWS CPU 1377 0 31575 22.93
SOFT_MAX CPU 514 0 655688 1275.66
ROPE CPU 1370 0 104577 76.33
GLU OPU 220 220 802890 3649.50

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    36.2350   36.2350     29.9250  [3.81e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     4.8710    4.8710      4.8710  └─ [5.12e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.7290    0.7290      0.6690  └─ [7.67e-04%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0600    0.0300      0.0600    └─ [6.31e-05%] tsi::runtime::executeWithTimeout
1     0.7100    0.7100      0.7100  └─ [7.47e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

624 518.6700 0.8312 0.0000 [5.46e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
624 91913.1910 147.2968 91913.1910 └─ [96.69%] TXE 0 Idle
88 79.3888 0.9021 79.3888 └─ [8.35e-02%] [ txe_swiglu ]
180 68.6057 0.3811 68.6057 └─ [7.22e-02%] [ txe_rms_norm ]
180 54.7632 0.3042 54.7632 └─ [5.76e-02%] [ txe_mult ]
176 50.1866 0.2852 50.1866 └─ [5.28e-02%] [ txe_add ]

[Thread] OPU (cumulative over all threads)

1     5.1320    5.1320      4.8810  [5.40e-03%] [Thread] OPU 
1     0.2510    0.2510      0.2510  └─ [2.64e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

624 426.2960 0.6832 406.4210 [4.48e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
624 19.8750 0.0319 19.8750 └─ [2.09e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

624 694.6400 1.1132 25.9990 [7.31e-01%] [Thread] tsi::runtime::TsavRT::processResponses
624 668.6410 1.0715 668.6410 └─ [7.03e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    80.6560   80.6560     63.1570  [8.48e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    17.4990   17.4990     17.4990  └─ [1.84e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

626 56.3350 0.0900 56.3350 [5.93e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

624 249.0610 0.3991 249.0610 [2.62e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

624 47.7800 0.0766 47.7800 [5.03e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

624 47.7840 0.0766 47.7840 [5.03e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

624 8.1720 0.0131 8.1720 [8.60e-03%] [Thread] tsi::runtime::TsavRT::deallocate

- 95063.6590    0.0000  95063.6590  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7891

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#

Posix results
[atrivedi@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "My cat’s name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
is “Sparky” and I like to

llama_perf_sampler_print: sampling time = 23.38 ms / 16 runs ( 1.46 ms per token, 684.29 tokens per second)
llama_perf_context_print: load time = 11765.44 ms
llama_perf_context_print: prompt eval time = 1875.26 ms / 6 tokens ( 312.54 ms per token, 3.20 tokens per second)
llama_perf_context_print: eval time = 2299.64 ms / 9 runs ( 255.52 ms per token, 3.91 tokens per second)
llama_perf_context_print: total time = 14091.36 ms / 15 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 7363 0 50464 6.85
MUL CPU 6618 0 60838 9.19
RMS_NORM CPU 6477 0 24035 3.71
MUL_MAT CPU 28386 0 49873174 1756.96
CPY CPU 177 0 24627 139.14
RESHAPE CPU 15525 0 6318 0.41
VIEW CPU 16600 0 2046 0.12
PERMUTE CPU 10211 0 1081 0.11
GET_ROWS CPU 413 0 15115 36.60
SET_ROWS CPU 6717 0 9140 1.36
ROPE CPU 7876 0 33477 4.25
FLASH_ATTN_EXT CPU 4043 0 265055 65.56
GLU CPU 3693 0 101761 27.56
[2025-10-29 08:35:16.416458] 3245264:3245264 [warning] TsavRT-0.4.5 TsavRT.h:165: TsavRT destructor reached without finalize()
pure virtual method called
No symbol table is loaded. Use the "file" command.
[New LWP 3245266]
[New LWP 3245267]
[New LWP 3245273]
[New LWP 3245274]
[New LWP 3245275]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/proj/local/gcc-13.3.0/lib64/libstdc++.so.6.0.32-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /proj/local/gcc-13.3.0/lib64/libstdc++.so.6.0.32-gdb.py
line to your configuration file "/users/atrivedi/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/users/atrivedi/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
0x00007f7ac2db4312 in waitpid () from /lib64/libpthread.so.0
No symbol "frame" in current context.
[Inferior 1 (process 3245264) detached]
terminate called without an active exception
Aborted (core dumped)
[atrivedi@ws01 llama.cpp]$
[atrivedi@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "My cat’s name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
is “Sparky” and I like

llama_perf_sampler_print: sampling time = 26.88 ms / 16 runs ( 1.68 ms per token, 595.22 tokens per second)
llama_perf_context_print: load time = 6260.56 ms
llama_perf_context_print: prompt eval time = 4963.25 ms / 6 tokens ( 827.21 ms per token, 1.21 tokens per second)
llama_perf_context_print: eval time = 7965.88 ms / 9 runs ( 885.10 ms per token, 1.13 tokens per second)
llama_perf_context_print: total time = 14256.38 ms / 15 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 2024 2234 3191330 1576.74
MUL OPU 2070 2285 1301083 628.54
RMS_NORM OPU 2070 2070 1572910 759.86
MUL_MAT CPU 36468 0 53550692 1468.43
CONT CPU 7567 0 425810 56.27
RESHAPE CPU 11611 0 6708 0.58
VIEW CPU 17644 0 2409 0.14
PERMUTE CPU 13935 0 3508 0.25
TRANSPOSE CPU 3341 0 827 0.25
GET_ROWS CPU 358 0 3702 10.34
SET_ROWS CPU 7326 0 7122 0.97
SOFT_MAX OPU 1012 35904 21545064 21289.59
ROPE CPU 7777 0 42279 5.44
GLU OPU 1012 1117 1566043 1547.47
to

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    33.1320   33.1320      6.8720  [2.02e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    26.1610   26.1610     15.1560  └─ [1.60e-01%] tsi::runtime::TsavRTPosix::initializeQueues
1     9.9160    9.9160      9.9160    └─ [6.06e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.0230    1.0230      1.0230    └─ [6.25e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0660    0.0660      0.0550    └─ [4.03e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0110    0.0110      0.0110      └─ [6.72e-05%] tsi::runtime::executeWithTimeout
1     0.0990    0.0990      0.0990  └─ [6.05e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1    44.6170   44.6170     43.9790  [2.73e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     0.6230    0.6230      0.0600  └─ [3.81e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.5630    0.5630      0.0940    └─ [3.44e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.4380    0.4380      0.4380      └─ [2.68e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0310    0.0310      0.0280      └─ [1.89e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.83e-05%] tsi::runtime::executeWithTimeout
2     0.0150    0.0075      0.0150  └─ [9.17e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

12650 2640.6690 0.2087 384.3910 [16.14%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
25300 2252.0890 0.0890 2252.0890 └─ [13.76%] tsi::runtime::executeWithTimeout
12650 4.1890 3.31e-04 4.1890 └─ [2.56e-02%] LOAD_BLOB Command Execution
12650 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
12650 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

12650 2232.9710 0.1765 439.3630 [13.65%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
25300 1788.2030 0.0707 1788.2030 └─ [10.93%] tsi::runtime::executeWithTimeout
12650 5.4050 4.27e-04 5.4050 └─ [3.30e-02%] UNLOAD_BLOB Command Execution
12650 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
12650 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

12652 2385.7820 0.1886 106.2750 [14.58%] [Thread] tsi::runtime::TsavRT::processResponses
12652 2279.5070 0.1802 2279.5070 └─ [13.93%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.0670    0.0670      0.0410  [4.09e-04%] [Thread] OPU 
1     0.0260    0.0260      0.0260  └─ [1.59e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

12650 203.3630 0.0161 184.7700 [ 1.24%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
12650 18.5930 0.0015 18.5930 └─ [1.14e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

12653 19.6360 0.0016 19.6360 [1.20e-01%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

12650 50.9880 0.0040 50.9880 [3.12e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

12650 2804.0340 0.2217 2804.0340 [17.14%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

12650 16.5870 0.0013 16.5870 [1.01e-01%] [Thread] tsi::runtime::TsavRT::deallocate

- 16362.8320    0.0000  16362.8320  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9998

[atrivedi@ws01 llama.cpp]$

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…vices (ggml-org#16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init

* CUDA: refactor and deduplicate vector FA kernels

…gml-org#16277) * CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 This commit adds mul_mat_id support for ncols_dst >= 16. It does this by packing ncols_dst tiles into the blockDim.y. My tests on a RTX 3090 show that this is faster than the cuBLAS fallback for f16 till bs=64, and for f32 till bs=32 * Review: refactor if statement

…l-org#16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same

…6160) The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed).

* vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes

* metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks

* metal : support mul_mm with src1->type == GGML_TYPE_F16 * metal : support mul_mm_id with src1->type == GGML_TYPE_F16 [no ci] * metal : mul_mm support ne00 % 32 != 0 * metal : support mul_mm_id with ne00 % 32 != 0 * cont : remove unnecessary unrolls * cont : simplify data loading * metal : optimize mul_mm when output bounds checks are not needed

* vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col

…gml-org#16297)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

…ired (ggml-org#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984) --------- Co-authored-by: Alde Rojas <hello@alde.dev>

…-org#16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

…#16292)

Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile)

* tools/main: llama-cli: prevent spurious assistant token (ggml-org#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes ggml-org#13402. Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com> --------- Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…witching to nullish coalescing for field values and default placeholders (ggml-org#16312)

* fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com>

* check cuda argsort limits and add test * add metal check

…rary fails (ggml-org#16172) This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

…l-org#16655) * add missing norm topk bias * use clamping instead, update number and add comment

…org#16691)

… to support large batch (ggml-org#16744) * fix k_compute_batched_ptrs * add backend ops test * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * reduce the batch size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…guous (ggml-org#16789) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape

…-org#16788)

* SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * Update ggml/src/ggml-sycl/repeat_back.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/repeat_back.hpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* sycl: add ROLL operation support - Implement ggml_sycl_roll function for F32 tensors - Add multi-axis roll operation with SYCL kernel - Support all 4 tensor dimensions with proper shift normalization - Add roll.cpp and roll.hpp to SYCL backend - Update backend dispatch and supports_op for GGML_OP_ROLL - Tests: 17662/17662 pass with identical CPU reference results * fix: remove trailing whitespace from roll.cpp - Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp - Remove trailing spaces from lines 6, 11, 28, 47, 58, 60 * ci: retrigger * sycl: remove wait() calls from ROLL operation * fix: editorconfig — LF endings + final newline for roll.hpp --------- Co-authored-by: tamarPal <tamarPal@example.com>

dineshReddy6381 · 2025-10-29T16:42:29Z

Tested all models on Posix after adding PR#73 changes. All models are passing. Attached are test results log
ggml-FIR-1035.txt

dineshReddy6381

Approved

akapoor3518

lgtm

…d openssl. This change brings in latest llama.cpp and adds 1. Open SSL 2. Lib CURL dependencies The arm tool chain also has been updated with following libraries in /proj/rel/sw area [atrivedi@ws01 workspace]$ ls /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/lib -lt total 41840 -rw-r--r-- 1 atrivedi tsiusers 730992 Oct 29 06:38 libzstd.so.1.4.8 -rw-r--r-- 1 atrivedi tsiusers 730992 Oct 29 06:38 libzstd.so.1 -rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so.1.3.1 -rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so.1 -rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so -rw-r--r-- 1 atrivedi tsiusers 128580 Oct 29 06:38 libz.a -rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so.5.2.0 -rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so.5 -rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so -rwxr-xr-x 1 atrivedi tsiusers 1196040 Oct 29 06:38 libssl.so.3 -rw-r--r-- 1 atrivedi tsiusers 132968 Oct 29 06:38 libidn2.so.0.4.0 -rw-r--r-- 1 atrivedi tsiusers 132968 Oct 29 06:38 libidn2.so.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.7.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.6.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.5.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.4.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.3.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.2.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.1.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.0.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.3 -rwxr-xr-x 1 atrivedi tsiusers 666912 Oct 29 06:38 libcurl.so -rw-r--r-- 1 atrivedi tsiusers 1066 Oct 29 06:38 libcurl.la -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.8.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.7.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.6.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.5.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.4.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.3.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.2.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.1.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.0.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.3 -rw-r--r-- 1 atrivedi tsiusers 776948 Oct 29 06:38 libcurl.a -rwxr-xr-x 1 atrivedi tsiusers 6302760 Oct 29 06:38 libcrypto.so.3 -rw-r--r-- 1 atrivedi tsiusers 2612824 Oct 29 06:38 libcrypto.so.1.1

angt and others added 30 commits September 27, 2025 19:17

server : remove old LLAMA_SERVER_SSL (ggml-org#16290)

234e2ff

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

vulkan: throw system error instead of SIGABRT during init on older de…

0499b29

…vices (ggml-org#16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init

CUDA: refactor and deduplicate vector FA kernels (ggml-org#16208)

75a3a6c

* CUDA: refactor and deduplicate vector FA kernels

Show message actions by default (ggml-org#16289)

4807e8f

vulkan : make the vulkan.hpp dynamic dispatcher instance private (ggm…

8656f5d

…l-org#16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same

metal : fuse non-sequential nodes (ggml-org#16102)

3b53634

* metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks

Fixed a few typos in the README of the LLaMA.cpp HTTP Server [no ci] (g…

2811c65

…gml-org#16297)

devops: switch to using ubuntu-22.04-s390x image (ggml-org#16302)

0124ac9

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ci : fix musa docker build (ggml-org#16306)

d9e0e7c

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (ggml…

b887d2f

…-org#16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

vulkan: Fix validation failure in quantized flash attention (ggml-org…

92cd103

…#16292)

ggml : fix dependencies for ggml_set_rows (ggml-org#16318)

a4a0aa5

perplexity : show more kl-divergence data (ggml-org#16321)

3ffd0fa

Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile)

fix: preserved zero values in chat settings inputs and textareas by s…

66bb798

…witching to nullish coalescing for field values and default placeholders (ggml-org#16312)

ggml : check cuda and metal argsort limits and add test (ggml-org#16323)

adc7634

* check cuda argsort limits and add test * add metal check

ggml : bump version to 0.9.1

2db78c7

ggml : prepare for development of 0.9.2-dev

b6dff20

ggml : bump version to 0.9.3 (ggml/1353)

b6ae75a

ggml : remove -dev suffix from release version (ggml/1355)

c9b1c06

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

sync : whisper.cpp (ggml/1359)

4d3d455

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

sync : ggml

2ddd3f2

CISC and others added 10 commits October 26, 2025 17:20

graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero (ggm…

f696428

…l-org#16655) * add missing norm topk bias * use clamping instead, update number and add comment

convert : enable expert group selection for all models with it (ggml-…

73a48c9

…org#16691)

cuda : use fast copy when src and dst are of different type and conti…

bd562fe

…guous (ggml-org#16789) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape

ggml-alloc : make gallocr prefer chunks that allow memory reuse (ggml…

3470a5c

…-org#16788)

CUDA: support for weight clamp in top-k norm (ggml-org#16702)

75d33b9

test-backend-ops: print failed tests at the end (ggml-org#16785)

75cbdd3

Merge remote-tracking branch 'upstream/master'

650a834

atrivedi-tsavoritesi requested review from Anindya-Tsavorite, DashingR, Nithyanand-G, akapoor3518, dineshReddy6381, dmpatra, gwolski, kraza8, mikeuhler, mmankal, rdubey-tsavoritesi and smehta-tsavoritesi October 29, 2025 14:16

atrivedi-tsavoritesi requested a review from gkethamallax as a code owner October 29, 2025 14:16

dineshReddy6381 approved these changes Oct 29, 2025

View reviewed changes

akapoor3518 approved these changes Oct 29, 2025

View reviewed changes

atrivedi-tsavoritesi force-pushed the FIR-1035 branch from 918fe1d to b4f5466 Compare October 29, 2025 17:31

atrivedi-tsavoritesi merged commit 59557f2 into master Oct 29, 2025

atrivedi-tsavoritesi deleted the FIR-1035 branch October 29, 2025 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fir 1035: Update Llama.cpp to latest and add all the new dependencies in tool chain #74

Fir 1035: Update Llama.cpp to latest and add all the new dependencies in tool chain #74

Uh oh!

atrivedi-tsavoritesi commented Oct 29, 2025 •

edited

Loading

Uh oh!

dineshReddy6381 commented Oct 29, 2025

Uh oh!

dineshReddy6381 left a comment

Uh oh!

akapoor3518 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

83 participants

Fir 1035: Update Llama.cpp to latest and add all the new dependencies in tool chain #74

Fir 1035: Update Llama.cpp to latest and add all the new dependencies in tool chain #74

Uh oh!

Conversation

atrivedi-tsavoritesi commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

624 426.2960 0.6832 406.4210 [4.48e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 624 19.8750 0.0319 19.8750 └─ [2.09e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

624 694.6400 1.1132 25.9990 [7.31e-01%] [Thread] tsi::runtime::TsavRT::processResponses 624 668.6410 1.0715 668.6410 └─ [7.03e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

626 56.3350 0.0900 56.3350 [5.93e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

624 249.0610 0.3991 249.0610 [2.62e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

624 47.7800 0.0766 47.7800 [5.03e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

624 47.7840 0.0766 47.7840 [5.03e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

624 8.1720 0.0131 8.1720 [8.60e-03%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7891

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

12652 2385.7820 0.1886 106.2750 [14.58%] [Thread] tsi::runtime::TsavRT::processResponses 12652 2279.5070 0.1802 2279.5070 └─ [13.93%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

12650 203.3630 0.0161 184.7700 [ 1.24%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 12650 18.5930 0.0015 18.5930 └─ [1.14e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

12653 19.6360 0.0016 19.6360 [1.20e-01%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

12650 50.9880 0.0040 50.9880 [3.12e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

12650 2804.0340 0.2217 2804.0340 [17.14%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

12650 16.5870 0.0013 16.5870 [1.01e-01%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9998

Uh oh!

dineshReddy6381 commented Oct 29, 2025

Uh oh!

dineshReddy6381 left a comment

Choose a reason for hiding this comment

Uh oh!

akapoor3518 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

83 participants

atrivedi-tsavoritesi commented Oct 29, 2025 •

edited

Loading

624 426.2960 0.6832 406.4210 [4.48e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
624 19.8750 0.0319 19.8750 └─ [2.09e-02%] tsi::runtime::executeWithTimeout

624 694.6400 1.1132 25.9990 [7.31e-01%] [Thread] tsi::runtime::TsavRT::processResponses
624 668.6410 1.0715 668.6410 └─ [7.03e-01%] tsi::runtime::executeWithTimeout

12652 2385.7820 0.1886 106.2750 [14.58%] [Thread] tsi::runtime::TsavRT::processResponses
12652 2279.5070 0.1802 2279.5070 └─ [13.93%] tsi::runtime::executeWithTimeout

12650 203.3630 0.0161 184.7700 [ 1.24%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
12650 18.5930 0.0015 18.5930 └─ [1.14e-01%] tsi::runtime::executeWithTimeout