Skip to content

support modern tokenizers for NLP #572

@martinjaggi

Description

@martinjaggi

if we can support recent standard tokenizers (byte-pair, sentencepiece) etc such as used for modern transformer models, we'll have much better compatibility with current NLP models. (and the tokenizers can later be used for any kind of small/big word2vec/lstm/transformer etc models)

the only thing needed at first will be the ability to load and use a predefined (pretrained) tokenizer, such as the ones for say GPT-style models, or from huggingface.

here is a suitable codebase in typescript
https://github.com/botisan-ai/gpt3-tokenizer

let's evaluate if we can use it and if it can load existing pretrained tokenizers (a nice vocabulary size is 2^15 = 32k, but can start smaller for a proof of concept)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions