Closed
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I think it would be useful if llama.cpp supported a vocabulary type that doesn't really have tokens but only works on raw bytes. Something like LLAMA_VOCAB_TYPE_RAW_BYTES
would be added to enum llama_vocab_type
but I don't know what kind of changes that would imply elsewhere. That kind of vocabulary would still require special tokens of course.
Motivation
There's already some interesting research about making token-free LLMs work:
- ByT5: Towards a token-free future with pre-trained byte-to-byte models
- MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
- Bytes Are All You Need: Transformers Operating Directly On File Bytes
- MambaByte: Token-free Selective State Space Model
And I think this is going to become even more relevant in the future. To quote Andrej Karpathy: "I would love nothing more than to be able to feed raw byte sequences into language models".
Possible Implementation
No response