server: fix system_tokens being erased in kv_cache; #6312
      
        
          +6
        
        
          −6
        
        
          
        
      
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Hi llama.cpp deveeloppers:)
I read the code these days. And I think there is a chance that system tokens may get erased in kv cache in the server example by this line:
From my little knowledge of the server code(may be wrong cause too many codes to read, sorry), I think in
kv_cache,system_tokensare positioned from 0 to its length, and aftersystem_tokensit'sn_keepprompt_tokens. So If the code before remove the tokens betweenn_keepton_keep + n_discarded, it will remove some of thesystem_tokenswhich makes the generation stop working or generate something meaningless.Below is my test. This problem only can be duplicated with some specific count of tokens. And I just run into it in my daily tests, that's why anything was in Chinese. Sorry again ;p
The system prompt is to make the assistant summarize some text. I wrote this in a translater.json file and use -spf parameter to load it:
{ "prompt": "Assistant's name is John. # CONTEXT # 我需要你总结概括一段文字。 # OBJECTIVE # 阅读用户发给你的文本,总结概括文本内容,为了用户阅读方便,请始终使用与原文本相同的语言进行总结概括。 # STYLE # 不需要有什么风格。 # TONE # 总结概括。 # AUDIENCE # 任何想要了解一段文字大意的人。 # RESPONSE # 回答应该明确易懂,简洁明了。使用与用户输入相同的语言。", "anti_prompt": "User", "assistant_name": "Assistant" }Then I start the server with these parameters, notice that
-cwas commented so its value is default 512. Also you can see I'm using Qwen model with a RTX4090 card:With the server running, I call curl with five questions:
And these are the generations:

As you can see, the first time I ask about its system prompt and name, It answers right. But after I give it a long text to summarize, It forgets its name and system prompt(In this picture, I just ask about the name.).

Now is the generation after applying this pr:



Now you see, It remembered who it is and what it should do!
summary
I made this change just because I find its a little better than before. I don't really know about the logic of the two part of token shift in server example. I tried to read hard about them, but still not really clear. If there is some documentations on kv_cache and these two part of shift code in server example, that would be great. Thanks a lot:)
If this change is wrong, feel free to close the pr.
Bellow is the exacte text I send to server:
I know the new version of server would say "context is too long for kv_cache, ...", so you have to use the exacte text to duplicate this issue.
Thanks in advance:)