Feature Request: Add vocabulary type for token-free models that work on raw bytes

### Prerequisites

- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

I think it would be useful if llama.cpp supported a vocabulary type that doesn't really have tokens but only works on raw bytes. Something like `LLAMA_VOCAB_TYPE_RAW_BYTES` would be added to `enum llama_vocab_type` but I don't know what kind of changes that would imply elsewhere. That kind of vocabulary would still require special tokens of course.

### Motivation

There's already some interesting research about making token-free LLMs work:
- [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626)
- [MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers](https://arxiv.org/pdf/2305.07185)
- [Bytes Are All You Need: Transformers Operating Directly On File Bytes](https://arxiv.org/abs/2306.00238)
- [MambaByte: Token-free Selective State Space Model](https://arxiv.org/abs/2401.13660)

And I think this is going to become even more relevant in the future. To quote Andrej Karpathy: ["I would love nothing more than to be able to feed raw byte sequences into language models"](https://youtu.be/zduSFxRajkE?feature=shared&t=1367).

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Add vocabulary type for token-free models that work on raw bytes #7763

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Add vocabulary type for token-free models that work on raw bytes #7763

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions