Skip to content

feat: add proper batching to perplexity#19661

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
AesSedai:perplexity-batching
Feb 16, 2026
Merged

feat: add proper batching to perplexity#19661
ggerganov merged 1 commit intoggml-org:masterfrom
AesSedai:perplexity-batching

Conversation

@AesSedai
Copy link
Contributor

@AesSedai AesSedai commented Feb 16, 2026

This PR updates llama-perplexity to allow for batching similarly to how llama-imatrix works. The idea being that you can increase --batch-size / --ubatch-size to process multiple contexts chunks in a batch. This has limited application in VRAM-rich environments (eg, if you're running the entire model in VRAM) but it makes a huge difference when using models in a mixed CPU/GPU setup as it saves n_seq trips from the CPU RAM to GPU VRAM per batch.

I've double-checked the before and after to make sure the resulting PPL and KLD look correct still.

👈 gemma-3-4b-it before
./build/bin/llama-perplexity --threads 48 --flash-attn on --file /mnt/srv/host/resources/KLD/calibration_datav3.txt --kl-divergence-base /mnt/srv/snowdrift/ref-logits-gemma-3-4b-it-BF16-calibration-datav3.bin --kl-divergence --model /mnt/srv/snowdrift/gguf/gemma-3-4b-it-GGUF/aes_sedai/gemma-3-4b-it-IQ2_S.gguf
...
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 512 cells
llama_kv_cache:      CUDA0 KV buffer size =     4.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =     6.00 MiB
llama_kv_cache: size =   10.00 MiB (   512 cells,   5 layers,  1/1 seqs), K (f16):    5.00 MiB, V (f16):    5.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 512 cells
llama_kv_cache:      CUDA0 KV buffer size =    20.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    38.00 MiB
llama_kv_cache: size =   58.00 MiB (   512 cells,  29 layers,  1/1 seqs), K (f16):   29.00 MiB, V (f16):   29.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =   103.07 MiB
sched_reserve:      CUDA1 compute buffer size =   541.20 MiB
sched_reserve:  CUDA_Host compute buffer size =    18.09 MiB
sched_reserve: graph nodes  = 1369
sched_reserve: graph splits = 3
sched_reserve: reserve took 5.66 ms, sched copies = 4
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 48 (n_threads_batch = 48) / 56 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 
= 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |                                                                                                                               
kl_divergence: 0.24 seconds per pass - ETA 0.52 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      15.0357 ±    3.3711       0.15313 ±    0.11885       0.80485 ±    0.09019    27.752 ±  2.247 %    64.706 ±  2.999 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   2      11.5069 ±    1.7149       0.23099 ±    0.09393       0.89338 ±    0.07533    29.758 ±  1.634 %    66.863 ±  2.086 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   3      10.0124 ±    1.1843       0.16441 ±    0.07179       0.79792 ±    0.05573    28.398 ±  1.288 %    69.673 ±  1.663 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   4      12.4115 ±    1.2811       0.13942 ±    0.06238       0.81656 ±    0.04655    26.965 ±  1.088 %    67.353 ±  1.469 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   5      13.5092 ±    1.2894       0.17750 ±    0.05767       0.83100 ±    0.04336    27.051 ±  0.977 %    68.314 ±  1.303 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   6      11.9776 ±    1.0511       0.24994 ±    0.05454       0.86421 ±    0.04179    28.124 ±  0.926 %    69.085 ±  1.182 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   7      13.0925 ±    1.0678       0.24534 ±    0.04886       0.83698 ±    0.03680    27.572 ±  0.842 %    68.627 ±  1.099 %
...
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 122      15.1762 ±    0.2888       0.16397 ±    0.00955       0.60769 ±    0.00698    21.281 ±  0.185 %    70.601 ±  0.258 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 123      15.0941 ±    0.2858       0.16347 ±    0.00951       0.60676 ±    0.00694    21.289 ±  0.185 %    70.633 ±  0.257 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 124      15.2232 ±    0.2872       0.16210 ±    0.00947       0.60811 ±    0.00691    21.289 ±  0.184 %    70.585 ±  0.256 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 125      15.3643 ±    0.2892       0.16266 ±    0.00945       0.60991 ±    0.00690    21.262 ±  0.183 %    70.510 ±  0.255 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 126      15.4685 ±    0.2900       0.16302 ±    0.00940       0.61048 ±    0.00686    21.256 ±  0.182 %    70.401 ±  0.255 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 127      15.6189 ±    0.2919       0.16244 ±    0.00938       0.61218 ±    0.00683    21.253 ±  0.181 %    70.307 ±  0.254 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 128      15.7610 ±    0.2937       0.16414 ±    0.00934       0.61333 ±    0.00679    21.225 ±  0.180 %    70.233 ±  0.253 %

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
 129      15.9194 ±    0.2959       0.16444 ±    0.00932       0.61541 ±    0.00678    21.223 ±  0.179 %    70.144 ±  0.252 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  15.919427 ±   0.295934
Mean PPL(base)                :  13.505476 ±   0.255762
Cor(ln(PPL(Q)), ln(PPL(base))):  87.69%
Mean ln(PPL(Q)/PPL(base))     :   0.164445 ±   0.009317
Mean PPL(Q)/PPL(base)         :   1.178739 ±   0.010983
Mean PPL(Q)-PPL(base)         :   2.413951 ±   0.142312

====== KL divergence statistics ======
Mean    KLD:   0.615415 ±   0.006776
Maximum KLD:  32.369621
99.9%   KLD:  13.583662
99.0%   KLD:   5.987407
95.0%   KLD:   2.377642
90.0%   KLD:   1.476913
Median  KLD:   0.251914
10.0%   KLD:   0.001650
 5.0%   KLD:   0.000196
 1.0%   KLD:   0.000004
 0.1%   KLD:   0.000000
Minimum KLD:  -0.000003

====== Token probability statistics ======
Mean    Δp: -5.715 ± 0.113 %
Maximum Δp: 99.972%
99.9%   Δp: 91.203%
99.0%   Δp: 46.782%
95.0%   Δp: 14.613%
90.0%   Δp:  5.583%
75.0%   Δp:  0.194%
Median  Δp: -0.186%
25.0%   Δp: -7.205%
10.0%   Δp: -27.139%
 5.0%   Δp: -46.367%
 1.0%   Δp: -89.934%
 0.1%   Δp: -99.930%
Minimum Δp: -100.000%
RMS Δp    : 21.223 ± 0.179 %
Same top p: 70.144 ± 0.252 %

llama_perf_context_print:        load time =     830.11 ms
llama_perf_context_print: prompt eval time =   12023.81 ms / 66048 tokens (    0.18 ms per token,  5493.10 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   22498.18 ms / 66049 tokens
llama_perf_context_print:    graphs reused =        128
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24135 = 3524 + ( 466 =   339 +      24 +     103) +       20144 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24135 = 7060 + (1621 =  1036 +      44 +     541) +       15452 |
llama_memory_breakdown_print: |   - Host               |                  458 =   440 +       0 +      18                |

👈 gemma-3-4b-it after
./build/bin/llama-perplexity --threads 48 --flash-attn on --file /mnt/srv/host/resources/KLD/calibration_datav3.txt --kl-divergence-base /mnt/srv/snowdrift/ref-logits-gemma-3-4b-it-BF16-calibration-datav3.bin --kl-divergence --model /mnt/srv/snowdrift/gguf/gemma-3-4b-it-GGUF/aes_sedai/gemma-3-4b-it-IQ2_S.gguf --batch-size 4096 --ubatch-size 4096
...
llama_context: constructing llama_context
llama_context: n_seq_max     = 8
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 4096
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     8.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 512 cells
llama_kv_cache:      CUDA0 KV buffer size =    48.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =    32.00 MiB
llama_kv_cache: size =   80.00 MiB (   512 cells,   5 layers,  8/8 seqs), K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 512 cells
llama_kv_cache:      CUDA0 KV buffer size =   240.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   224.00 MiB
llama_kv_cache: size =  464.00 MiB (   512 cells,  29 layers,  8/8 seqs), K (f16):  232.00 MiB, V (f16):  232.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size =   824.56 MiB
sched_reserve:      CUDA1 compute buffer size =  4329.62 MiB
sched_reserve:  CUDA_Host compute buffer size =   144.69 MiB
sched_reserve: graph nodes  = 1437
sched_reserve: graph splits = 3
sched_reserve: reserve took 30.67 ms, sched copies = 4
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 48 (n_threads_batch = 48) / 56 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512
_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |                                                                                                                                                       
kl_divergence: computing over 129 chunks, n_ctx=512, batch_size=4096, n_seq=8
kl_divergence: 0.97 seconds per pass - ETA 0.25 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      14.9041 ±    3.3408       0.14434 ±    0.11943       0.79810 ±    0.08939    27.540 ±  2.222 %    63.922 ±  3.013 %
   2      11.4567 ±    1.7040       0.22661 ±    0.09373       0.89141 ±    0.07495    29.718 ±  1.625 %    66.471 ±  2.093 %
   3       9.9591 ±    1.1768       0.15908 ±    0.07163       0.79424 ±    0.05538    28.277 ±  1.283 %    69.804 ±  1.661 %
   4      12.3268 ±    1.2704       0.13257 ±    0.06233       0.81133 ±    0.04621    26.862 ±  1.084 %    67.353 ±  1.469 %
   5      13.4636 ±    1.2848       0.17412 ±    0.05765       0.82728 ±    0.04318    26.983 ±  0.975 %    68.000 ±  1.307 %
   6      11.9417 ±    1.0488       0.24694 ±    0.05469       0.86203 ±    0.04178    28.100 ±  0.924 %    68.889 ±  1.184 %
   7      13.0485 ±    1.0644       0.24197 ±    0.04901       0.83525 ±    0.03680    27.559 ±  0.841 %    68.459 ±  1.100 %
...
 122      15.1723 ±    0.2886       0.16371 ±    0.00955       0.60753 ±    0.00696    21.275 ±  0.186 %    70.476 ±  0.259 %
 123      15.0894 ±    0.2856       0.16317 ±    0.00951       0.60659 ±    0.00692    21.281 ±  0.185 %    70.515 ±  0.257 %
 124      15.2173 ±    0.2870       0.16171 ±    0.00947       0.60783 ±    0.00690    21.278 ±  0.184 %    70.468 ±  0.257 %
 125      15.3579 ±    0.2890       0.16225 ±    0.00944       0.60959 ±    0.00688    21.251 ±  0.183 %    70.394 ±  0.256 %
 126      15.4621 ±    0.2898       0.16262 ±    0.00940       0.61015 ±    0.00684    21.243 ±  0.182 %    70.289 ±  0.255 %
 127      15.6122 ±    0.2917       0.16202 ±    0.00937       0.61183 ±    0.00681    21.242 ±  0.181 %    70.196 ±  0.254 %
 128      15.7527 ±    0.2934       0.16362 ±    0.00933       0.61289 ±    0.00677    21.213 ±  0.180 %    70.113 ±  0.253 %
 129      15.9111 ±    0.2957       0.16392 ±    0.00931       0.61497 ±    0.00675    21.211 ±  0.180 %    70.026 ±  0.253 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  15.911081 ±   0.295677
Mean PPL(base)                :  13.505476 ±   0.255762
Cor(ln(PPL(Q)), ln(PPL(base))):  87.69%
Mean ln(PPL(Q)/PPL(base))     :   0.163921 ±   0.009314
Mean PPL(Q)/PPL(base)         :   1.178121 ±   0.010973
Mean PPL(Q)-PPL(base)         :   2.405605 ±   0.142151

====== KL divergence statistics ======
Mean    KLD:   0.614975 ±   0.006752
Maximum KLD:  31.323584
99.9%   KLD:  13.271286
99.0%   KLD:   5.966651
95.0%   KLD:   2.371548
90.0%   KLD:   1.477644
Median  KLD:   0.251741
10.0%   KLD:   0.001684
 5.0%   KLD:   0.000194
 1.0%   KLD:   0.000004
 0.1%   KLD:   0.000000
Minimum KLD:  -0.000003

====== Token probability statistics ======
Mean    Δp: -5.711 ± 0.113 %
Maximum Δp: 99.977%
99.9%   Δp: 91.460%
99.0%   Δp: 46.731%
95.0%   Δp: 14.554%
90.0%   Δp:  5.534%
75.0%   Δp:  0.191%
Median  Δp: -0.185%
25.0%   Δp: -7.210%
10.0%   Δp: -27.139%
 5.0%   Δp: -45.980%
 1.0%   Δp: -89.789%
 0.1%   Δp: -99.906%
Minimum Δp: -100.000%
RMS Δp    : 21.211 ± 0.180 %
Same top p: 70.026 ± 0.253 %

llama_perf_context_print:        load time =     912.78 ms
llama_perf_context_print: prompt eval time =    9235.28 ms / 66048 tokens (    0.14 ms per token,  7151.71 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   20546.10 ms / 66049 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24135 = 22167 + (1614 =   502 +     288 +     824) +         352 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24135 = 18325 + (5459 =   873 +     256 +    4329) +         349 |
llama_memory_breakdown_print: |   - Host               |                   584 =   440 +       0 +     144                |
Model State n_seq Mean PPL Mean KLD Prompt Eval Time Tk/s Total Time
Gemma3 4B Before 1 15.919427 ± 0.295934 0.615415 ± 0.006776 12023.81 ms 5493.10 22498.18 ms
Gemma3 4B After 8 15.911081 ± 0.295677 0.614975 ± 0.006752 9235.28 ms 7151.71 20546.10 ms
MiniMax-M2.5 Before 1 8.266626 ± 0.135207 0.243048 ± 0.004122 261952.68 ms 236.50 272760.91 ms
MiniMax-M2.5 After 8 8.286406 ± 0.135739 0.245355 ± 0.004200 63141.19 ms 981.17 75111.33 ms

There's a couple of other small changes to add the total chunk count to the output early on, like llama-imatrix does, and to remove the print for the chunk headers every cycle just to clean the CLI output up a bit.

I recommend setting --batch-size and --ubatch-size both when testing, because otherwise you end up with similar performance as the n_seq=1 case.

@ggerganov ggerganov merged commit d612901 into ggml-org:master Feb 16, 2026
78 checks passed
michaelneale added a commit to michaelneale/llama.cpp that referenced this pull request Feb 17, 2026
* upstream/master: (88 commits)
  ci : bump komac version (ggml-org#19682)
  build : link ws2_32 as PUBLIC on Windows (ggml-org#19666)
  build : cleanup library linking logic (ggml-org#19665)
  convert : add JoyAI-LLM-Flash (ggml-org#19651)
  perplexity: add proper batching (ggml-org#19661)
  common : inline functions (ggml-org#18639)
  ggml : make `ggml_is_view` as API (ggml-org#19539)
  model: Add support for Tiny Aya Models (ggml-org#19611)
  build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658)
  Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591)
  models : deduplicate delta-net graphs for Qwen family (ggml-org#19597)
  graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644)
  ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel  (ggml-org#19132)
  sync : ggml
  ggml : bump version to 0.9.7 (ggml/1425)
  ggml : bump version to 0.9.6 (ggml/1423)
  cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624)
  docs: update s390x build docs (ggml-org#19643)
  build : remove LLAMA_HTTPLIB option (ggml-org#19623)
  cmake : check if KleidiAI API has been fetched (ggml-org#19640)
  ...
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants