Closed
Description
In ./server, trying to correctly use Continuation mode with Llama 3 70B is not possible, as the correct prompt template cannot be entered. This is because the token <|eot_id|> becomes zero tokens, even when it occurs midway through the prompt:
(In the above image, I hit start and looked at the number of tokens cached minus the number of tokens predicted: 402 - 400 = 2. This value is the number of tokens I typed as my prompt. Llama's result shown is 2 where it should be 3. I deleted the generated tokens before taking this screenshot, to show what I originally typed)
This token is required multiple times by the prompt template, which looks like this:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
[system prompt goes here]<|eot_id|><|start_header_id|>user<|end_header_id|>
[user prompt goes here]<|eot_id|><|start_header_id|>assistant<|end_header_id|>
[ai response will go here]
Not adhering to the prompt usually decreases the ability of the LLM.