kv-cache : rework kv_cell #13706

ggerganov · 2025-05-22T13:05:16Z

The KV cells editing logic is now implemented via the new struct llama_kv_cells_unified in the new src/llama-kv-cells.h source. The goal is to simplify the implementation in llama-kv-cache.cpp and make it easier to understand and update in the future.

One of the primary simplifications is that llama_kv_cache_unified no longer tracks the number of used cells manually. This is now automatically tracked by the llama_kv_cells_unified based on the edits that we apply, such as adding and removing sequences from the cells. Same for the has_shift flag.

The cell information (i.e. pos, delta, seq) is now a structure of arrays for better cache locality
The sequences that belong to a cell are now tracked with a std::bitset instead of std::set

Here is an example of the position shift logic before and after the change:

// before
    for (uint32_t i = 0; i < size; ++i) {
        if (cells[i].has_seq_id(seq_id) && cells[i].pos >= p0 && cells[i].pos < p1) {
            has_shift = true;

            cells[i].pos   += delta;
            cells[i].delta += delta;

            if (cells[i].pos < 0) {
                if (!cells[i].is_empty()) {
                    used--;
                }
                cells[i].pos = -1;
                cells[i].seq_id.clear();
                if (new_head == size) {
                    new_head = i;
                }
            }
        }
    }

// after
    for (uint32_t i = 0; i < cells.size(); ++i) {
        if (!cells.pos_in(i, p0, p1)) {
            continue;
        }

        if (cells.seq_has(i, seq_id)) {
            if (cells.pos_add(i, delta)) {
                if (new_head == cells.size()) {
                    new_head = i;
                }
            }
        }
    }

    
           // ============================================================================================================= 
        
           // TODO: refactor and simplify this [TAG: KV_API] 
        
           virtual llama_sbatch sbatch_init(const llama_batch & batch, bool logits_all) = 0; 
        
           // different KV caches require different batch splitting strategies 
        
           virtual llama_ubatch ubatch_next(llama_sbatch & sbatch, uint32_t n_ubatch, bool embd_pooled) const = 0; 
        
           // find an empty slot of size "n_tokens" in the cache 
        
           virtual bool find_slot(const llama_ubatch & batch) = 0;

The main goal is to be able to run SWA caches with just n_swa + n_ubatch cells as explained in #13194 (comment). This refactor will also absorb the find_slot logic so that the llama_context won't need to be aware of this implementation detail. The batch -> sbatch -> ubatch logic will also be better contained within the llama_kv_cache and won't leak into llama_decode() as it does now, which will be useful later for implementing new KV caches.

When this rework is ready, I will use the new init(batch) mechanism to remove the set_full() call from the interface:

llama.cpp/src/llama-kv-cache.h

Lines 37 to 41 in 9023ae3

    
           // simulate full cache, used for allocating worst-case compute buffers 
        
           // TODO: remove 
        
           virtual void set_full() = 0;

Simulating a full cache will be now achieved by initializing the appropriate batches and just not processing them.

Any suggestions about the plan are welcome.

src/llama-kv-cells.h

ggml-ci

ggerganov · 2025-05-25T08:46:37Z

While this change does not have a measurable impact on the performance under normal conditions, when building in Debug I observe a significant performance gain, which is nice:

./scripts/compare-commits.sh master gg/kv-cache-simplify-part2 -m ./models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -fa 1 -d 8192 -n 128 -p 0,1024 -r 5

Model	Test	t/s master	t/s gg/kv-cache-simplify-part2	Speedup
llama 1B Q8_0	pp1024@d8192	1562.31	2040.47	1.31
llama 1B Q8_0	tg128@d8192	157.03	166.46	1.06

diff --git a/scripts/compare-commits.sh b/scripts/compare-commits.sh
index e40d1cc6d..7d9ca79cf 100755
--- a/scripts/compare-commits.sh
+++ b/scripts/compare-commits.sh
@@ -24,7 +24,7 @@ dir="build-bench"
 
 function run {
     rm -fr ${dir} > /dev/null
-    cmake -B ${dir} -S . $cmake_opts > /dev/null
+    cmake -DCMAKE_BUILD_TYPE=Debug -B ${dir} -S . $cmake_opts > /dev/null
     cmake --build ${dir} -t llama-bench > /dev/null
     ${dir}/bin/llama-bench -o sql -oe md $bench_args | sqlite3 llama-bench.sqlite
 }

ggerganov requested a review from slaren May 22, 2025 13:26

slaren reviewed May 23, 2025

View reviewed changes

ggerganov force-pushed the gg/kv-cache-simplify-part2 branch from 0a8cdc3 to eda2e13 Compare May 23, 2025 09:03

slaren approved these changes May 23, 2025

View reviewed changes

src/llama-kv-cells.h Outdated Show resolved Hide resolved

ggerganov added 6 commits May 24, 2025 13:36

kv-cache : rework kv_cell

be9558e

ggml-ci

kv-cells : use "shift" instead of "delta" consistently

71be7e5

ggml-ci

llama : add llama_max_parallel_sequences()

7b3f12a

ggml-ci

kv-cells : update comments [no ci]

6221dd2

context : fail upon construction if sequences exceed max value

43b40d3

ggml-ci

kv-cells : get_pos() -> pos_get() + comments

f71e737

ggml-ci

ggerganov force-pushed the gg/kv-cache-simplify-part2 branch from 1ec785c to 0dc4804 Compare May 24, 2025 10:37

kv-cells : fix tracking of "used" cells

dd394a6

ggml-ci

ggerganov force-pushed the gg/kv-cache-simplify-part2 branch from 0dc4804 to dd394a6 Compare May 24, 2025 10:38

ggerganov mentioned this pull request May 24, 2025

kv-cache : refactor + add llama_memory_state_i #13746

Merged

5 tasks

ggerganov merged commit de2ef53 into master May 25, 2025
53 checks passed

ggerganov deleted the gg/kv-cache-simplify-part2 branch May 25, 2025 13:34

This was referenced May 26, 2025

scripts : add option to compare commits in Debug #13806

Merged

kv-cells : track min/max used cells and per-sequence positions #13808

Merged

llama : validate seq id batch input #13809

Merged

cmake : add llama-cparams.cpp to build #13832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : rework kv_cell #13706

kv-cache : rework kv_cell #13706

Uh oh!

ggerganov commented May 22, 2025 •

edited

Loading

Uh oh!

ggerganov commented May 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 25, 2025

Uh oh!

Uh oh!

Uh oh!

kv-cache : rework kv_cell #13706

kv-cache : rework kv_cell #13706

Uh oh!

Conversation

ggerganov commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Next

Uh oh!

ggerganov commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 25, 2025

Uh oh!

Uh oh!

Uh oh!

ggerganov commented May 22, 2025 •

edited

Loading

ggerganov commented May 22, 2025 •

edited

Loading