Bug: KV cache load/save is slow

### What happened?

I wrote a KV cache cache, and then benchmarked it.

`llama_state_seq_get_size`, `llama_state_seq_get_data`, and `llama_state_seq_set_data` are slow enough that it is significantly (13x) better to just start over from nothing each time.

However, from looking through the code, I think there is opportunity to improve quite a lot. (It is unclear to me whether these improvements will be sufficient to make it worth managing an external cache, but in theory I think it _ought_ to be possible.)

Here are a few observations, starting with just the `get` APIs...

---

`llama_state_seq_get_size` does a full copy from the GPU and throws it away. (My cache management implementation is in Go, so for GC/allocator reasons, I need the size up front.)

```c++
size_t llama_state_seq_get_size(struct llama_context *ctx,
                                llama_seq_id seq_id) {
  llama_data_write_dummy data_ctx;
  return llama_state_seq_get_data_internal(ctx, data_ctx, seq_id);
}
```

---

In `write_kv_cache_data`, we have lots of double-copying, from GPU to staging area and then staging area to destination. For example:

```c++
        tmp_buf.resize(range_size * k_size_row);
        ggml_backend_tensor_get(kv_self.k_l[il], tmp_buf.data(),
                                range.first * k_size_row,
                                range_size * k_size_row);
        write(tmp_buf.data(), tmp_buf.size());
```

An extremely crude benchmark suggests that this double-copy is ~5% of the runtime of `llama_state_seq_get_data`.

---

We call `ggml_backend_tensor_get` a lot of times. In the case in which the tensors are contiguous, it would probably be significantly faster to do a single transfer. A back of the envelope calculation about PCIe data transfer rates suggests that we are nowhere near saturating the bus, and there is very little computation going on, which suggests per-transfer latency overhead as a major culprit.

---

I'm using an RTX 4090 with a server-grade motherboard.

cc @abetlen
cc @slaren (per suggestion of @abetlen)


### Name and Version

$ ./llama-cli --version
version: 3488 (75af08c4)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu


### What operating system are you seeing the problem on?

Linux

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: KV cache load/save is slow #8915

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: KV cache load/save is slow #8915

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions