Skip to content

Add support for tokenizers tokenizers #1251

Closed
@gabe-l-hart

Description

@gabe-l-hart

🚀 The feature, motivation and pitch

The request is to extend the tokenizer module in torchchat to support tokenizers that use the Huggingface tokenizers library.

There are many models out there that use tokenizers which won't be able to run in torchchat until they can be loaded and run either via the tokenizers library directly or via a conversion to tiktoken or sentencepiece.

Alternatives

It may be possible to convert a tokenizers tokenizer to a tiktoken tokenizer. I have a working implementation of this for the llama tokenizer.json model, however other models that use different tokenizers configurations do not work (in particular Granite Code).

Additional context

This issue is a piece of the puzzle for adding support for Granite Code 3b/8b which use the llama architecture in transormers, but take advantage several pieces of the architecture that are not currently supported by torchchat. The work-in-progress for Granite Code can be found on my fork: https://github.com/gabe-l-hart/torchchat/tree/GraniteCodeSupport.

I have a less fully-fleshed working version of this that I plan to put up as a Draft PR for discussion. I am not intimately familiar with the algorithmic differences between tiktoken and the various tokenizers pieces (in particular the pretokenizers). My branch has a python implementation that simply wraps tokenizers, but I have not yet tried to export Granite Code to other formats where I suspect it would break without a corresponding c++ implementation. I plan to investigate this further soon!

RFC (Optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions