Skip to content

Enabling cache_prompt on completion request fills KV cache quickly  #4989

Closed
@BruceMacD

Description

@BruceMacD

llama.cpp version: 5c99960

When running the llama.cpp example server and sending requests with cache_prompt the model will start predicting continuously and fill the KV cache. The amount of time it takes varies based on context size, but the default context size (512) can run out of KV cache very quickly, within 3 requests.

Expected Behavior

Enabling prompt caching does not effect inference, and request fails gracefully on filled KV cache.

Current Behavior

Enabling cache_prompt on requests to the example server's /completion endpoint results in a filled KV cache quite quickly, and continuous prediction before the failure.

Environment and Context

$ system_profiler SPSoftwareDataType SPHardwareDataType
Software:
    System Software Overview:
      System Version: macOS 14.1 (23B2073)
      Kernel Version: Darwin 23.1.0
Hardware:
    Hardware Overview:
      Model Name: MacBook Pro
      Model Identifier: Mac15,8
      Chip: Apple M3 Max
      Total Number of Cores: 16 (12 performance and 4 efficiency)
      Memory: 128 GB
      System Firmware Version: 10151.41.12
      OS Loader Version: 10151.41.12

Reproduction

  1. Start the llama.cpp example server with default configuration.
./server -m mistral-7b-instruct-v0.2.Q4_0.gguf
  1. Run this python script to send requests:
import requests


def main():
    url = "http://127.0.0.1:8080/completion"
    data = {
        "prompt": "Why is the sky blue?",
        "cache_prompt": True,
    }
    total = 0

    for i in range(100):
            print(f"sending request {i=}")
            with requests.post(url, json=data) as response: # Hangs about every 5 requests on my system
                if not response.ok:
                    print(response)

if __name__ == "__main__":
    main()
  1. After a few requests the KV cache will be full.
    Here is the relevant logging:
print_timings: prompt eval time =      16.85 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =    1325.03 ms /    86 runs   (   15.41 ms per token,    64.90 tokens per second)
print_timings:       total time =    1341.87 ms
slot 0 released (93 tokens in cache)
{"timestamp":1705437767,"level":"INFO","function":"log_server_request","line":2812,"message":"request","remote_addr":"127.0.0.1","remote_port":54752,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 2]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1

Here is the prediction output on the last request before it hangs, with whitespace omitted :

The sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight enters Earth's atmosphere and encounters tiny molecules of gases such as nitrogen and oxygen. These molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths. This is known as Rayleigh scattering.

*lots of white-space omitted here*

MSMSMSMSMS

Potentially related: #4185

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions