Closed
Description
llama.cpp version: 5c99960
When running the llama.cpp example server and sending requests with cache_prompt
the model will start predicting continuously and fill the KV cache. The amount of time it takes varies based on context size, but the default context size (512) can run out of KV cache very quickly, within 3 requests.
Expected Behavior
Enabling prompt caching does not effect inference, and request fails gracefully on filled KV cache.
Current Behavior
Enabling cache_prompt
on requests to the example server's /completion
endpoint results in a filled KV cache quite quickly, and continuous prediction before the failure.
Environment and Context
$ system_profiler SPSoftwareDataType SPHardwareDataType
Software:
System Software Overview:
System Version: macOS 14.1 (23B2073)
Kernel Version: Darwin 23.1.0
Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: Mac15,8
Chip: Apple M3 Max
Total Number of Cores: 16 (12 performance and 4 efficiency)
Memory: 128 GB
System Firmware Version: 10151.41.12
OS Loader Version: 10151.41.12
Reproduction
- Start the llama.cpp example server with default configuration.
./server -m mistral-7b-instruct-v0.2.Q4_0.gguf
- Run this python script to send requests:
import requests
def main():
url = "http://127.0.0.1:8080/completion"
data = {
"prompt": "Why is the sky blue?",
"cache_prompt": True,
}
total = 0
for i in range(100):
print(f"sending request {i=}")
with requests.post(url, json=data) as response: # Hangs about every 5 requests on my system
if not response.ok:
print(response)
if __name__ == "__main__":
main()
- After a few requests the KV cache will be full.
Here is the relevant logging:
print_timings: prompt eval time = 16.85 ms / 0 tokens ( inf ms per token, 0.00 tokens per second)
print_timings: eval time = 1325.03 ms / 86 runs ( 15.41 ms per token, 64.90 tokens per second)
print_timings: total time = 1341.87 ms
slot 0 released (93 tokens in cache)
{"timestamp":1705437767,"level":"INFO","function":"log_server_request","line":2812,"message":"request","remote_addr":"127.0.0.1","remote_port":54752,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 2]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1
Here is the prediction output on the last request before it hangs, with whitespace omitted :
The sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight enters Earth's atmosphere and encounters tiny molecules of gases such as nitrogen and oxygen. These molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths. This is known as Rayleigh scattering.
*lots of white-space omitted here*
MSMSMSMSMS
Potentially related: #4185