support modern tokenizers for NLP

if we can support recent standard tokenizers (byte-pair, sentencepiece) etc such as used for modern transformer models, we'll have much better compatibility with current NLP models. (and the tokenizers can later be used for any kind of small/big word2vec/lstm/transformer etc models)

the only thing needed at first will be the ability to load and use a predefined (pretrained) tokenizer, such as the ones for say GPT-style models, or from huggingface.

here is a suitable codebase in typescript
https://github.com/botisan-ai/gpt3-tokenizer

let's evaluate if we can use it and if it can load existing pretrained tokenizers (a nice vocabulary size is 2^15 = 32k, but can start smaller for a proof of concept)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support modern tokenizers for NLP #572

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

support modern tokenizers for NLP #572

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions