Skip to content

Add tokenization support to Disco LLMs #646

@JulienVig

Description

@JulienVig

Previous and current works on LLM integration to DISCO relies on pre-tokenized datasets and doesn't account for token decoding after inference.

Full tokenizer support would allow:

  • To input natural text which is tokenized in DISCO
  • To perform Token decoding after model inference
  • To be able to use a pre-trained LLM in Disco first requires converting the weights to a format compatible with TF.js or other JS libraries. But additionally, we will also need to use the model's "pre-trained" tokenizer, and will also need to convert the tokenizers to JavaScript.
  • Train a tokenizer with a particular algorithm (e.g. SentencePiece BPE) on a custom dataset, use it when training a model, and save it along with the model
  • To save and store tokenizer alongside the model with which it was used

Metadata

Metadata

Assignees

Labels

discojsRelated to Disco.js

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions