-
Notifications
You must be signed in to change notification settings - Fork 31
Closed
Description
Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.
Full tokenizer support would allow:
- To input natural text which is tokenized in DISCO
- To perform Token decoding after model inference
- To be able to use a pre-trained LLM in Disco first requires converting the weights to a format compatible with TF.js or other JS libraries. But additionally, we will also need to use the model's "pre-trained" tokenizer, and will also need to convert the tokenizers to JavaScript.
- Train a tokenizer with a particular algorithm (e.g. SentencePiece BPE) on a custom dataset, use it when training a model, and save it along with the model
- To save and store tokenizer alongside the model with which it was used
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
discojsRelated to Disco.jsRelated to Disco.js