Add tokenization support to Disco LLMs

Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.

Full tokenizer support would allow:
- [x] To input natural text which is tokenized in DISCO
- [x] To perform Token decoding after model inference
- [x] To be able to use a pre-trained LLM in Disco first requires converting the weights to a format compatible with TF.js or other JS libraries. But additionally, we will also need to use the model's "pre-trained" tokenizer, and will also need to convert the tokenizers to JavaScript.
- [ ] Train a tokenizer with a particular algorithm (e.g. SentencePiece BPE) on a custom dataset, use it when training a model, and save it along with the model
- [ ] To save and store tokenizer alongside the model with which it was used


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenization support to Disco LLMs #646

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add tokenization support to Disco LLMs #646

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions