Add support for `tokenizers` tokenizers

### 🚀 The feature, motivation and pitch

The request is to extend the [tokenizer](https://github.com/pytorch/torchchat/tree/main/tokenizer) module in `torchchat` to support tokenizers that use the Huggingface [tokenizers](https://github.com/huggingface/tokenizers) library.

There are many models out there that use `tokenizers` which won't be able to run in `torchchat` until they can be loaded and run either via the `tokenizers` library directly or via a conversion to `tiktoken` or `sentencepiece`.

### Alternatives

It may be possible to convert a `tokenizers` tokenizer to a `tiktoken` tokenizer. I have a working implementation of this for the `llama` [tokenizer.json](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/tokenizer.json) model, however other models that use different `tokenizers` configurations do not work (in particular Granite Code).

### Additional context

This issue is a piece of the puzzle for adding support for Granite Code 3b/8b which use the llama architecture in transormers, but take advantage several pieces of the architecture that are not currently supported by torchchat. The work-in-progress for Granite Code can be found on my fork: https://github.com/gabe-l-hart/torchchat/tree/GraniteCodeSupport.

I have a less fully-fleshed working version of this that I plan to put up as a Draft PR for discussion. I am not intimately familiar with the algorithmic differences between `tiktoken` and the various `tokenizers` pieces (in particular the `pretokenizer`s). My branch has a python implementation that simply wraps `tokenizers`, but I have not yet tried to export Granite Code to other formats where I suspect it would break without a corresponding `c++` implementation. I plan to investigate this further soon!

### RFC (Optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for `tokenizers` tokenizers #1251

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for tokenizers tokenizers #1251

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add support for `tokenizers` tokenizers #1251