Fix logprobs when multiple tokens are returned at once. #141
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes a few issues with logprobs:
Here's an example of the current output. To reproduce this more easily, I set "Helloxxx" as a stop string, which causes "Hello" + " !" to be returned together by exllamav2:
Note that "tokens" is "Hi", even though the actual text is "Hello!", and the logprobs for the two are lumped together. With this update:
On the chat completion side, with a similar output where "Hello" + "!" are returned together:
The tokens are mismatched: the "!" token is missing and the top_logprobs are off by one. This now returns:
A couple things that still need to be figured out:
I'm not sure if text_offset supposed to be the offset into the text string (this is close to what it was doing before, so I went with that for now), or the offset into the full context. I can't find OAI docs on this, but from some API snippets I've seen it might be the latter. (It's simple to derive from the other data, so maybe nobody's actually using this field right now.)
Results are odd when token healing is enabled, since the regenerated initial token is included in the list. For example, if the context was "https://", and token healing backs up by three characters and generates "://www", it currently returns that whole underlying token (and a text_offset of -3, since the token starts three characters before the start of the output). But from the client's perspective all that the model actually generated was "www". The token healing overlap should probably be trimmed off from the output, so concatenating the "token" in each entry always gives the same result as "text". I'll return to this after discussion.