Use the HuggingFace llama Tokenizer

The [`tokenizers`](https://github.com/huggingface/tokenizers) crate by HuggingFace should give us a more correct tokenizer implementation than the one we're currently using.

Looks like a LLaMA implementation already landed there https://github.com/huggingface/transformers/pull/21955, and then @Narsil shared an additional PR on the tokenizers crate (not sure what this fixes, but I assume the changes are necessary?) https://github.com/huggingface/tokenizers/pull/1183

Seems like we have everything we need to use the new tokenizer. An important point remains though: Are we allowed to distribute the tokenizer file? Can it be considered a completely independent thing from the weights?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use the HuggingFace llama Tokenizer #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use the HuggingFace llama Tokenizer #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions