feat: add proper batching to perplexity by AesSedai · Pull Request #19661 · ggml-org/llama.cpp

AesSedai · 2026-02-16T08:09:23Z

This PR updates llama-perplexity to allow for batching similarly to how llama-imatrix works. The idea being that you can increase --batch-size / --ubatch-size to process multiple contexts chunks in a batch. This has limited application in VRAM-rich environments (eg, if you're running the entire model in VRAM) but it makes a huge difference when using models in a mixed CPU/GPU setup as it saves n_seq trips from the CPU RAM to GPU VRAM per batch.

I've double-checked the before and after to make sure the resulting PPL and KLD look correct still.

👈 gemma-3-4b-it before

./build/bin/llama-perplexity --threads 48 --flash-attn on --file /mnt/srv/host/resources/KLD/calibration_datav3.txt --kl-divergence-base /mnt/srv/snowdrift/ref-logits-gemma-3-4b-it-BF16-calibration-datav3.bin --kl-divergence --model /mnt/srv/snowdrift/gguf/gemma-3-4b-it-GGUF/aes_sedai/gemma-3-4b-it-IQ2_S.gguf
...
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 512 cells
llama_kv_cache:      CUDA0 KV buffer size =     4.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =     6.00 MiB
llama_kv_cache: size =   10.00 MiB (   512 cells,   5 layers,  1/1 seqs), K (f16):    5.00 MiB, V (f16):    5.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 512 cells
llama_kv_cache:      CUDA0 KV buffer size =    20.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    38.00 MiB
llama_kv_cache: size =   58.00 MiB (   512 cells,  29 layers,  1/1 seqs), K (f16):   29.00 MiB, V (f16):   29.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =   103.07 MiB
sched_reserve:      CUDA1 compute buffer size =   541.20 MiB
sched_reserve:  CUDA_Host compute buffer size =    18.09 MiB
sched_reserve: graph nodes  = 1369
sched_reserve: graph splits = 3
sched_reserve: reserve took 5.66 ms, sched copies = 4
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 48 (n_threads_batch = 48) / 56 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 
= 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |                                                                                                                               
kl_divergence: 0.24 seconds per pass - ETA 0.52 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      15.0357 ±    3.3711       0.15313 ±    0.11885       0.80485 ±    0.09019    27.752 ±  2.247 %    64.706 ±  2.999 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   2      11.5069 ±    1.7149       0.23099 ±    0.09393       0.89338 ±    0.07533    29.758 ±  1.634 %    66.863 ±  2.086 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   3      10.0124 ±    1.1843       0.16441 ±    0.07179       0.79792 ±    0.05573    28.398 ±  1.288 %    69.673 ±  1.663 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   4      12.4115 ±    1.2811       0.13942 ±    0.06238       0.81656 ±    0.04655    26.965 ±  1.088 %    67.353 ±  1.469 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   5      13.5092 ±    1.2894       0.17750 ±    0.05767       0.83100 ±    0.04336    27.051 ±  0.977 %    68.314 ±  1.303 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   6      11.9776 ±    1.0511       0.24994 ±    0.05454       0.86421 ±    0.04179    28.124 ±  0.926 %    69.085 ±  1.182 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   7      13.0925 ±    1.0678       0.24534 ±    0.04886       0.83698 ±    0.03680    27.572 ±  0.842 %    68.627 ±  1.099 %
...
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 122      15.1762 ±    0.2888       0.16397 ±    0.00955       0.60769 ±    0.00698    21.281 ±  0.185 %    70.601 ±  0.258 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 123      15.0941 ±    0.2858       0.16347 ±    0.00951       0.60676 ±    0.00694    21.289 ±  0.185 %    70.633 ±  0.257 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 124      15.2232 ±    0.2872       0.16210 ±    0.00947       0.60811 ±    0.00691    21.289 ±  0.184 %    70.585 ±  0.256 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 125      15.3643 ±    0.2892       0.16266 ±    0.00945       0.60991 ±    0.00690    21.262 ±  0.183 %    70.510 ±  0.255 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 126      15.4685 ±    0.2900       0.16302 ±    0.00940       0.61048 ±    0.00686    21.256 ±  0.182 %    70.401 ±  0.255 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 127      15.6189 ±    0.2919       0.16244 ±    0.00938       0.61218 ±    0.00683    21.253 ±  0.181 %    70.307 ±  0.254 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 128      15.7610 ±    0.2937       0.16414 ±    0.00934       0.61333 ±    0.00679    21.225 ±  0.180 %    70.233 ±  0.253 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 129      15.9194 ±    0.2959       0.16444 ±    0.00932       0.61541 ±    0.00678    21.223 ±  0.179 %    70.144 ±  0.252 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  15.919427 ±   0.295934
Mean PPL(base)                :  13.505476 ±   0.255762
Cor(ln(PPL(Q)), ln(PPL(base))):  87.69%
Mean ln(PPL(Q)/PPL(base))     :   0.164445 ±   0.009317
Mean PPL(Q)/PPL(base)         :   1.178739 ±   0.010983
Mean PPL(Q)-PPL(base)         :   2.413951 ±   0.142312

====== KL divergence statistics ======
Mean    KLD:   0.615415 ±   0.006776
Maximum KLD:  32.369621
99.9%   KLD:  13.583662
99.0%   KLD:   5.987407
95.0%   KLD:   2.377642
90.0%   KLD:   1.476913
Median  KLD:   0.251914
10.0%   KLD:   0.001650
 5.0%   KLD:   0.000196
 1.0%   KLD:   0.000004
 0.1%   KLD:   0.000000
Minimum KLD:  -0.000003

====== Token probability statistics ======
Mean    Δp: -5.715 ± 0.113 %
Maximum Δp: 99.972%
99.9%   Δp: 91.203%
99.0%   Δp: 46.782%
95.0%   Δp: 14.613%
90.0%   Δp:  5.583%
75.0%   Δp:  0.194%
Median  Δp: -0.186%
25.0%   Δp: -7.205%
10.0%   Δp: -27.139%
 5.0%   Δp: -46.367%
 1.0%   Δp: -89.934%
 0.1%   Δp: -99.930%
Minimum Δp: -100.000%
RMS Δp    : 21.223 ± 0.179 %
Same top p: 70.144 ± 0.252 %

llama_perf_context_print:        load time =     830.11 ms
llama_perf_context_print: prompt eval time =   12023.81 ms / 66048 tokens (    0.18 ms per token,  5493.10 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   22498.18 ms / 66049 tokens
llama_perf_context_print:    graphs reused =        128
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24135 = 3524 + ( 466 =   339 +      24 +     103) +       20144 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24135 = 7060 + (1621 =  1036 +      44 +     541) +       15452 |
llama_memory_breakdown_print: |   - Host               |                  458 =   440 +       0 +      18                |

👈 gemma-3-4b-it after

./build/bin/llama-perplexity --threads 48 --flash-attn on --file /mnt/srv/host/resources/KLD/calibration_datav3.txt --kl-divergence-base /mnt/srv/snowdrift/ref-logits-gemma-3-4b-it-BF16-calibration-datav3.bin --kl-divergence --model /mnt/srv/snowdrift/gguf/gemma-3-4b-it-GGUF/aes_sedai/gemma-3-4b-it-IQ2_S.gguf --batch-size 4096 --ubatch-size 4096
...
llama_context: constructing llama_context
llama_context: n_seq_max     = 8
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 4096
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     8.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 512 cells
llama_kv_cache:      CUDA0 KV buffer size =    48.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    32.00 MiB
llama_kv_cache: size =   80.00 MiB (   512 cells,   5 layers,  8/8 seqs), K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 512 cells
llama_kv_cache:      CUDA0 KV buffer size =   240.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   224.00 MiB
llama_kv_cache: size =  464.00 MiB (   512 cells,  29 layers,  8/8 seqs), K (f16):  232.00 MiB, V (f16):  232.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =   824.56 MiB
sched_reserve:      CUDA1 compute buffer size =  4329.62 MiB
sched_reserve:  CUDA_Host compute buffer size =   144.69 MiB
sched_reserve: graph nodes  = 1437
sched_reserve: graph splits = 3
sched_reserve: reserve took 30.67 ms, sched copies = 4
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 48 (n_threads_batch = 48) / 56 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512
_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |                                                                                                                                                       
kl_divergence: computing over 129 chunks, n_ctx=512, batch_size=4096, n_seq=8
kl_divergence: 0.97 seconds per pass - ETA 0.25 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      14.9041 ±    3.3408       0.14434 ±    0.11943       0.79810 ±    0.08939    27.540 ±  2.222 %    63.922 ±  3.013 %
   2      11.4567 ±    1.7040       0.22661 ±    0.09373       0.89141 ±    0.07495    29.718 ±  1.625 %    66.471 ±  2.093 %
   3       9.9591 ±    1.1768       0.15908 ±    0.07163       0.79424 ±    0.05538    28.277 ±  1.283 %    69.804 ±  1.661 %
   4      12.3268 ±    1.2704       0.13257 ±    0.06233       0.81133 ±    0.04621    26.862 ±  1.084 %    67.353 ±  1.469 %
   5      13.4636 ±    1.2848       0.17412 ±    0.05765       0.82728 ±    0.04318    26.983 ±  0.975 %    68.000 ±  1.307 %
   6      11.9417 ±    1.0488       0.24694 ±    0.05469       0.86203 ±    0.04178    28.100 ±  0.924 %    68.889 ±  1.184 %
   7      13.0485 ±    1.0644       0.24197 ±    0.04901       0.83525 ±    0.03680    27.559 ±  0.841 %    68.459 ±  1.100 %
...
 122      15.1723 ±    0.2886       0.16371 ±    0.00955       0.60753 ±    0.00696    21.275 ±  0.186 %    70.476 ±  0.259 %
 123      15.0894 ±    0.2856       0.16317 ±    0.00951       0.60659 ±    0.00692    21.281 ±  0.185 %    70.515 ±  0.257 %
 124      15.2173 ±    0.2870       0.16171 ±    0.00947       0.60783 ±    0.00690    21.278 ±  0.184 %    70.468 ±  0.257 %
 125      15.3579 ±    0.2890       0.16225 ±    0.00944       0.60959 ±    0.00688    21.251 ±  0.183 %    70.394 ±  0.256 %
 126      15.4621 ±    0.2898       0.16262 ±    0.00940       0.61015 ±    0.00684    21.243 ±  0.182 %    70.289 ±  0.255 %
 127      15.6122 ±    0.2917       0.16202 ±    0.00937       0.61183 ±    0.00681    21.242 ±  0.181 %    70.196 ±  0.254 %
 128      15.7527 ±    0.2934       0.16362 ±    0.00933       0.61289 ±    0.00677    21.213 ±  0.180 %    70.113 ±  0.253 %
 129      15.9111 ±    0.2957       0.16392 ±    0.00931       0.61497 ±    0.00675    21.211 ±  0.180 %    70.026 ±  0.253 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  15.911081 ±   0.295677
Mean PPL(base)                :  13.505476 ±   0.255762
Cor(ln(PPL(Q)), ln(PPL(base))):  87.69%
Mean ln(PPL(Q)/PPL(base))     :   0.163921 ±   0.009314
Mean PPL(Q)/PPL(base)         :   1.178121 ±   0.010973
Mean PPL(Q)-PPL(base)         :   2.405605 ±   0.142151

====== KL divergence statistics ======
Mean    KLD:   0.614975 ±   0.006752
Maximum KLD:  31.323584
99.9%   KLD:  13.271286
99.0%   KLD:   5.966651
95.0%   KLD:   2.371548
90.0%   KLD:   1.477644
Median  KLD:   0.251741
10.0%   KLD:   0.001684
 5.0%   KLD:   0.000194
 1.0%   KLD:   0.000004
 0.1%   KLD:   0.000000
Minimum KLD:  -0.000003

====== Token probability statistics ======
Mean    Δp: -5.711 ± 0.113 %
Maximum Δp: 99.977%
99.9%   Δp: 91.460%
99.0%   Δp: 46.731%
95.0%   Δp: 14.554%
90.0%   Δp:  5.534%
75.0%   Δp:  0.191%
Median  Δp: -0.185%
25.0%   Δp: -7.210%
10.0%   Δp: -27.139%
 5.0%   Δp: -45.980%
 1.0%   Δp: -89.789%
 0.1%   Δp: -99.906%
Minimum Δp: -100.000%
RMS Δp    : 21.211 ± 0.180 %
Same top p: 70.026 ± 0.253 %

llama_perf_context_print:        load time =     912.78 ms
llama_perf_context_print: prompt eval time =    9235.28 ms / 66048 tokens (    0.14 ms per token,  7151.71 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   20546.10 ms / 66049 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24135 = 22167 + (1614 =   502 +     288 +     824) +         352 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24135 = 18325 + (5459 =   873 +     256 +    4329) +         349 |
llama_memory_breakdown_print: |   - Host               |                   584 =   440 +       0 +     144                |

Model	State	n_seq	Mean PPL	Mean KLD	Prompt Eval Time	Tk/s	Total Time
Gemma3 4B	Before	1	15.919427 ± 0.295934	0.615415 ± 0.006776	12023.81 ms	5493.10	22498.18 ms
Gemma3 4B	After	8	15.911081 ± 0.295677	0.614975 ± 0.006752	9235.28 ms	7151.71	20546.10 ms
MiniMax-M2.5	Before	1	8.266626 ± 0.135207	0.243048 ± 0.004122	261952.68 ms	236.50	272760.91 ms
MiniMax-M2.5	After	8	8.286406 ± 0.135739	0.245355 ± 0.004200	63141.19 ms	981.17	75111.33 ms

There's a couple of other small changes to add the total chunk count to the output early on, like llama-imatrix does, and to remove the print for the chunk headers every cycle just to clean the CLI output up a bit.

I recommend setting --batch-size and --ubatch-size both when testing, because otherwise you end up with similar performance as the n_seq=1 case.

* upstream/master: (88 commits) ci : bump komac version (ggml-org#19682) build : link ws2_32 as PUBLIC on Windows (ggml-org#19666) build : cleanup library linking logic (ggml-org#19665) convert : add JoyAI-LLM-Flash (ggml-org#19651) perplexity: add proper batching (ggml-org#19661) common : inline functions (ggml-org#18639) ggml : make `ggml_is_view` as API (ggml-org#19539) model: Add support for Tiny Aya Models (ggml-org#19611) build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658) Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591) models : deduplicate delta-net graphs for Qwen family (ggml-org#19597) graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644) ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (ggml-org#19132) sync : ggml ggml : bump version to 0.9.7 (ggml/1425) ggml : bump version to 0.9.6 (ggml/1423) cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624) docs: update s390x build docs (ggml-org#19643) build : remove LLAMA_HTTPLIB option (ggml-org#19623) cmake : check if KleidiAI API has been fetched (ggml-org#19640) ...

perplexity: add proper batching

73483a3

AesSedai requested a review from ggerganov as a code owner February 16, 2026 08:09

github-actions bot added the examples label Feb 16, 2026

ggerganov approved these changes Feb 16, 2026

View reviewed changes

ggerganov merged commit d612901 into ggml-org:master Feb 16, 2026
78 checks passed

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026

perplexity: add proper batching (ggml-org#19661)

bd44697

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add proper batching to perplexity#19661

feat: add proper batching to perplexity#19661
ggerganov merged 1 commit intoggml-org:masterfrom
AesSedai:perplexity-batching

AesSedai commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AesSedai commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AesSedai commented Feb 16, 2026 •

edited

Loading