Enabling `cache_prompt` on completion request fills KV cache quickly 

llama.cpp version: 5c999609013a30c06e6fd28be8db5c2074bcc196

When running the llama.cpp example server and sending requests with `cache_prompt` the model will start predicting continuously and fill the KV cache. The amount of time it takes varies based on context size, but the default context size (512) can run out of KV cache very quickly, within 3 requests.

# Expected Behavior
Enabling prompt caching does not effect inference, and request fails gracefully on filled KV cache.

# Current Behavior
Enabling `cache_prompt` on  requests to the example server's `/completion` endpoint results in a filled KV cache quite quickly, and continuous prediction before the failure.

# Environment and Context
```bash
$ system_profiler SPSoftwareDataType SPHardwareDataType
Software:
    System Software Overview:
      System Version: macOS 14.1 (23B2073)
      Kernel Version: Darwin 23.1.0
Hardware:
    Hardware Overview:
      Model Name: MacBook Pro
      Model Identifier: Mac15,8
      Chip: Apple M3 Max
      Total Number of Cores: 16 (12 performance and 4 efficiency)
      Memory: 128 GB
      System Firmware Version: 10151.41.12
      OS Loader Version: 10151.41.12
```

# Reproduction
1. Start the llama.cpp example server with default configuration.
```bash
./server -m mistral-7b-instruct-v0.2.Q4_0.gguf
```
2. Run this python script to send requests:
```python
import requests


def main():
    url = "http://127.0.0.1:8080/completion"
    data = {
        "prompt": "Why is the sky blue?",
        "cache_prompt": True,
    }
    total = 0

    for i in range(100):
            print(f"sending request {i=}")
            with requests.post(url, json=data) as response: # Hangs about every 5 requests on my system
                if not response.ok:
                    print(response)

if __name__ == "__main__":
    main()
```
3. After a few requests the KV cache will be full. 
Here is the relevant logging:
```bash
print_timings: prompt eval time =      16.85 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =    1325.03 ms /    86 runs   (   15.41 ms per token,    64.90 tokens per second)
print_timings:       total time =    1341.87 ms
slot 0 released (93 tokens in cache)
{"timestamp":1705437767,"level":"INFO","function":"log_server_request","line":2812,"message":"request","remote_addr":"127.0.0.1","remote_port":54752,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 2]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1
```

Here is the prediction output on the last request before it hangs, with whitespace omitted :
```bash
The sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight enters Earth's atmosphere and encounters tiny molecules of gases such as nitrogen and oxygen. These molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths. This is known as Rayleigh scattering.

*lots of white-space omitted here*

MSMSMSMSMS
```

Potentially related: #4185

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabling `cache_prompt` on completion request fills KV cache quickly #4989

Expected Behavior

Current Behavior

Environment and Context

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enabling cache_prompt on completion request fills KV cache quickly #4989

Description

Expected Behavior

Current Behavior

Environment and Context

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Enabling `cache_prompt` on completion request fills KV cache quickly #4989