forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
Fir 1035: Update Llama.cpp to latest and add all the new dependencies in tool chain #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
…vices (ggml-org#16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init
* CUDA: refactor and deduplicate vector FA kernels
…gml-org#16277) * CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 This commit adds mul_mat_id support for ncols_dst >= 16. It does this by packing ncols_dst tiles into the blockDim.y. My tests on a RTX 3090 show that this is faster than the cuBLAS fallback for f16 till bs=64, and for f32 till bs=32 * Review: refactor if statement
…l-org#16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same
…6160) The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed).
* vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes
* metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks
* metal : support mul_mm with src1->type == GGML_TYPE_F16 * metal : support mul_mm_id with src1->type == GGML_TYPE_F16 [no ci] * metal : mul_mm support ne00 % 32 != 0 * metal : support mul_mm_id with ne00 % 32 != 0 * cont : remove unnecessary unrolls * cont : simplify data loading * metal : optimize mul_mm when output bounds checks are not needed
* vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
…ired (ggml-org#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984) --------- Co-authored-by: Alde Rojas <hello@alde.dev>
…-org#16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd
Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile)
* tools/main: llama-cli: prevent spurious assistant token (ggml-org#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes ggml-org#13402. Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com> --------- Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…witching to nullish coalescing for field values and default placeholders (ggml-org#16312)
* fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com>
* check cuda argsort limits and add test * add metal check
…rary fails (ggml-org#16172) This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).
This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.
* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp
…l-org#16655) * add missing norm topk bias * use clamping instead, update number and add comment
… to support large batch (ggml-org#16744) * fix k_compute_batched_ptrs * add backend ops test * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * reduce the batch size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
…guous (ggml-org#16789) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape
* SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * Update ggml/src/ggml-sycl/repeat_back.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/repeat_back.hpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* sycl: add ROLL operation support - Implement ggml_sycl_roll function for F32 tensors - Add multi-axis roll operation with SYCL kernel - Support all 4 tensor dimensions with proper shift normalization - Add roll.cpp and roll.hpp to SYCL backend - Update backend dispatch and supports_op for GGML_OP_ROLL - Tests: 17662/17662 pass with identical CPU reference results * fix: remove trailing whitespace from roll.cpp - Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp - Remove trailing spaces from lines 6, 11, 28, 47, 58, 60 * ci: retrigger * sycl: remove wait() calls from ROLL operation * fix: editorconfig — LF endings + final newline for roll.hpp --------- Co-authored-by: tamarPal <tamarPal@example.com>
|
Tested all models on Posix after adding PR#73 changes. All models are passing. Attached are test results log |
dineshReddy6381
approved these changes
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved
akapoor3518
approved these changes
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
…d openssl. This change brings in latest llama.cpp and adds 1. Open SSL 2. Lib CURL dependencies The arm tool chain also has been updated with following libraries in /proj/rel/sw area [atrivedi@ws01 workspace]$ ls /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/lib -lt total 41840 -rw-r--r-- 1 atrivedi tsiusers 730992 Oct 29 06:38 libzstd.so.1.4.8 -rw-r--r-- 1 atrivedi tsiusers 730992 Oct 29 06:38 libzstd.so.1 -rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so.1.3.1 -rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so.1 -rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so -rw-r--r-- 1 atrivedi tsiusers 128580 Oct 29 06:38 libz.a -rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so.5.2.0 -rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so.5 -rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so -rwxr-xr-x 1 atrivedi tsiusers 1196040 Oct 29 06:38 libssl.so.3 -rw-r--r-- 1 atrivedi tsiusers 132968 Oct 29 06:38 libidn2.so.0.4.0 -rw-r--r-- 1 atrivedi tsiusers 132968 Oct 29 06:38 libidn2.so.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.7.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.6.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.5.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.4.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.3.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.2.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.1.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.0.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.3 -rwxr-xr-x 1 atrivedi tsiusers 666912 Oct 29 06:38 libcurl.so -rw-r--r-- 1 atrivedi tsiusers 1066 Oct 29 06:38 libcurl.la -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.8.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.7.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.6.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.5.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.4.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.3.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.2.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.1.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.0.0 -rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.3 -rw-r--r-- 1 atrivedi tsiusers 776948 Oct 29 06:38 libcurl.a -rwxr-xr-x 1 atrivedi tsiusers 6302760 Oct 29 06:38 libcrypto.so.3 -rw-r--r-- 1 atrivedi tsiusers 2612824 Oct 29 06:38 libcrypto.so.1.1
918fe1d to
b4f5466
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Update LLAMA.cpp to latest, add libcurl and libssl_dev and other dependent libraries to support the merge.
[atrivedi@ws01 workspace]$ ls /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/lib -lt
total 41840
-rw-r--r-- 1 atrivedi tsiusers 730992 Oct 29 06:38 libzstd.so.1.4.8
-rw-r--r-- 1 atrivedi tsiusers 730992 Oct 29 06:38 libzstd.so.1
-rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so.1.3.1
-rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so.1
-rwxr-xr-x 1 atrivedi tsiusers 96248 Oct 29 06:38 libz.so
-rw-r--r-- 1 atrivedi tsiusers 128580 Oct 29 06:38 libz.a
-rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so.5.2.0
-rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so.5
-rwxr-xr-x 1 atrivedi tsiusers 2033616 Oct 29 06:38 libunistring.so
-rwxr-xr-x 1 atrivedi tsiusers 1196040 Oct 29 06:38 libssl.so.3
-rw-r--r-- 1 atrivedi tsiusers 132968 Oct 29 06:38 libidn2.so.0.4.0
-rw-r--r-- 1 atrivedi tsiusers 132968 Oct 29 06:38 libidn2.so.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.7.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.6.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.5.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.4.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.3.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.2.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.1.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.4.0.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl.so.3
-rwxr-xr-x 1 atrivedi tsiusers 666912 Oct 29 06:38 libcurl.so
-rw-r--r-- 1 atrivedi tsiusers 1066 Oct 29 06:38 libcurl.la
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.8.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.7.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.6.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.5.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.4.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.3.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.2.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.1.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.4.0.0
-rwxr-xr-x 1 atrivedi tsiusers 1195448 Oct 29 06:38 libcurl-compat.so.3
-rw-r--r-- 1 atrivedi tsiusers 776948 Oct 29 06:38 libcurl.a
-rwxr-xr-x 1 atrivedi tsiusers 6302760 Oct 29 06:38 libcrypto.so.3
-rw-r--r-- 1 atrivedi tsiusers 2612824 Oct 29 06:38 libcrypto.so.1.1
Test result on FPGA3
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh
is Luna.
llama_perf_sampler_print: sampling time = 117.57 ms / 11 runs ( 10.69 ms per token, 93.56 tokens per second)
llama_perf_context_print: load time = 54557.03 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 82307.61 ms / 4 runs (20576.90 ms per token, 0.05 tokens per second)
llama_perf_context_print: total time = 92882.72 ms / 5 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 440 440 1306380 2969.05
MUL OPU 450 450 667893 1484.21
RMS_NORM OPU 450 450 1092385 2427.52
MUL_MAT CPU 7848 0 588990168 75049.72
CONT CPU 1301 0 1511420 1161.74
RESHAPE CPU 1118 0 20732 18.54
VIEW CPU 1857 0 2588 1.39
PERMUTE CPU 1315 0 2365 1.80
TRANSPOSE CPU 369 0 791 2.14
GET_ROWS CPU 75 0 20576 274.35
SET_ROWS CPU 1377 0 31575 22.93
SOFT_MAX CPU 514 0 655688 1275.66
ROPE CPU 1370 0 104577 76.33
GLU OPU 220 220 802890 3649.50
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
624 518.6700 0.8312 0.0000 [5.46e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
624 91913.1910 147.2968 91913.1910 └─ [96.69%] TXE 0 Idle
88 79.3888 0.9021 79.3888 └─ [8.35e-02%] [ txe_swiglu ]
180 68.6057 0.3811 68.6057 └─ [7.22e-02%] [ txe_rms_norm ]
180 54.7632 0.3042 54.7632 └─ [5.76e-02%] [ txe_mult ]
176 50.1866 0.2852 50.1866 └─ [5.28e-02%] [ txe_add ]
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
624 426.2960 0.6832 406.4210 [4.48e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
624 19.8750 0.0319 19.8750 └─ [2.09e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
624 694.6400 1.1132 25.9990 [7.31e-01%] [Thread] tsi::runtime::TsavRT::processResponses
624 668.6410 1.0715 668.6410 └─ [7.03e-01%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
626 56.3350 0.0900 56.3350 [5.93e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)
624 249.0610 0.3991 249.0610 [2.62e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
624 47.7800 0.0766 47.7800 [5.03e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)
624 47.7840 0.0766 47.7840 [5.03e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
624 8.1720 0.0131 8.1720 [8.60e-03%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.7891
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
Posix results
[atrivedi@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "My cat’s name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
is “Sparky” and I like to
llama_perf_sampler_print: sampling time = 23.38 ms / 16 runs ( 1.46 ms per token, 684.29 tokens per second)
llama_perf_context_print: load time = 11765.44 ms
llama_perf_context_print: prompt eval time = 1875.26 ms / 6 tokens ( 312.54 ms per token, 3.20 tokens per second)
llama_perf_context_print: eval time = 2299.64 ms / 9 runs ( 255.52 ms per token, 3.91 tokens per second)
llama_perf_context_print: total time = 14091.36 ms / 15 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 7363 0 50464 6.85
MUL CPU 6618 0 60838 9.19
RMS_NORM CPU 6477 0 24035 3.71
MUL_MAT CPU 28386 0 49873174 1756.96
CPY CPU 177 0 24627 139.14
RESHAPE CPU 15525 0 6318 0.41
VIEW CPU 16600 0 2046 0.12
PERMUTE CPU 10211 0 1081 0.11
GET_ROWS CPU 413 0 15115 36.60
SET_ROWS CPU 6717 0 9140 1.36
ROPE CPU 7876 0 33477 4.25
FLASH_ATTN_EXT CPU 4043 0 265055 65.56
GLU CPU 3693 0 101761 27.56
[2025-10-29 08:35:16.416458] 3245264:3245264 [warning] TsavRT-0.4.5 TsavRT.h:165: TsavRT destructor reached without finalize()
pure virtual method called
No symbol table is loaded. Use the "file" command.
[New LWP 3245266]
[New LWP 3245267]
[New LWP 3245273]
[New LWP 3245274]
[New LWP 3245275]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/proj/local/gcc-13.3.0/lib64/libstdc++.so.6.0.32-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /proj/local/gcc-13.3.0/lib64/libstdc++.so.6.0.32-gdb.py
line to your configuration file "/users/atrivedi/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/users/atrivedi/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
0x00007f7ac2db4312 in waitpid () from /lib64/libpthread.so.0
No symbol "frame" in current context.
[Inferior 1 (process 3245264) detached]
terminate called without an active exception
Aborted (core dumped)
[atrivedi@ws01 llama.cpp]$
[atrivedi@ws01 llama.cpp]$ build-posix/bin/llama-cli -p "My cat’s name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
is “Sparky” and I like
llama_perf_sampler_print: sampling time = 26.88 ms / 16 runs ( 1.68 ms per token, 595.22 tokens per second)
llama_perf_context_print: load time = 6260.56 ms
llama_perf_context_print: prompt eval time = 4963.25 ms / 6 tokens ( 827.21 ms per token, 1.21 tokens per second)
llama_perf_context_print: eval time = 7965.88 ms / 9 runs ( 885.10 ms per token, 1.13 tokens per second)
llama_perf_context_print: total time = 14256.38 ms / 15 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 2024 2234 3191330 1576.74
MUL OPU 2070 2285 1301083 628.54
RMS_NORM OPU 2070 2070 1572910 759.86
MUL_MAT CPU 36468 0 53550692 1468.43
CONT CPU 7567 0 425810 56.27
RESHAPE CPU 11611 0 6708 0.58
VIEW CPU 17644 0 2409 0.14
PERMUTE CPU 13935 0 3508 0.25
TRANSPOSE CPU 3341 0 827 0.25
GET_ROWS CPU 358 0 3702 10.34
SET_ROWS CPU 7326 0 7122 0.97
SOFT_MAX OPU 1012 35904 21545064 21289.59
ROPE CPU 7777 0 42279 5.44
GLU OPU 1012 1117 1566043 1547.47
to
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
12650 2640.6690 0.2087 384.3910 [16.14%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
25300 2252.0890 0.0890 2252.0890 └─ [13.76%] tsi::runtime::executeWithTimeout
12650 4.1890 3.31e-04 4.1890 └─ [2.56e-02%] LOAD_BLOB Command Execution
12650 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
12650 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
12650 2232.9710 0.1765 439.3630 [13.65%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
25300 1788.2030 0.0707 1788.2030 └─ [10.93%] tsi::runtime::executeWithTimeout
12650 5.4050 4.27e-04 5.4050 └─ [3.30e-02%] UNLOAD_BLOB Command Execution
12650 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
12650 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
12652 2385.7820 0.1886 106.2750 [14.58%] [Thread] tsi::runtime::TsavRT::processResponses
12652 2279.5070 0.1802 2279.5070 └─ [13.93%] tsi::runtime::executeWithTimeout
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
12650 203.3630 0.0161 184.7700 [ 1.24%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
12650 18.5930 0.0015 18.5930 └─ [1.14e-01%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
12653 19.6360 0.0016 19.6360 [1.20e-01%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
12650 50.9880 0.0040 50.9880 [3.12e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
12650 2804.0340 0.2217 2804.0340 [17.14%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
12650 16.5870 0.0013 16.5870 [1.01e-01%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9998
[atrivedi@ws01 llama.cpp]$