-
Notifications
You must be signed in to change notification settings - Fork 31
Closed
Milestone
Description
if we can support recent standard tokenizers (byte-pair, sentencepiece) etc such as used for modern transformer models, we'll have much better compatibility with current NLP models. (and the tokenizers can later be used for any kind of small/big word2vec/lstm/transformer etc models)
the only thing needed at first will be the ability to load and use a predefined (pretrained) tokenizer, such as the ones for say GPT-style models, or from huggingface.
here is a suitable codebase in typescript
https://github.com/botisan-ai/gpt3-tokenizer
let's evaluate if we can use it and if it can load existing pretrained tokenizers (a nice vocabulary size is 2^15 = 32k, but can start smaller for a proof of concept)
Reactions are currently unavailable