server : add "token healing" support

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Feature Description

Hi! I am experimenting with using llama.cpp as a general-purpose code completion backend, similar to TabNine.

I am encountering a small problem: if the completion prompt ends mid-word, the results are not very accurate. For example, for a prompt such as `Five, Four, Thre` [sic], the model will often ignore the typo and suggest `, Two` (forming `Thre, Two`).

I think, as an option to the `/completion` server API, the following optional behavior would be useful:

1. Tokenize the text
2. Chop off the last token
3. Run the prediction with the remaining tokens, but only consider those tokens whose bytes start with the bytes of the last token.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : add "token healing" support #5765

Prerequisites

Feature Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server : add "token healing" support #5765

Description

Prerequisites

Feature Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions