Work around for recalculating logits in cached prompts (Fixes #1585) #1609
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This code just checks if we've used up all the given prompt but still have leftover tokens from the cache. In this case it instead leaves the last token in
embd
so it will be evaluated.Evaluating one token isn't the end of the world, but it would be nice if it were faster. What I really wanted to do was something like this:
But you can't eval 0 tokens.
I tried to edit
llama_eval_internal
to skip over the input calculations wheren_past
(N
) is 0, but it's actually way more integrated than I expected. Perhaps there was something to #1281 after all. I think it would be nicer ifllama_eval
just supported 0 tokens, but it would also be nice to have a different api call that would evaluate the logits at a given positionn_past
.This API call could also be used for implementing the mockup in #1608 where you can click on a token and see alternatives.