Skip to content

Whisper Tokenizer support #7353

Open
Open
@MithrilMan

Description

@MithrilMan

Is your feature request related to a problem? Please describe.
Whisper tokenizer support needed

Describe the solution you'd like
Would be nice to have support for the Whisper tokenizer.

Describe alternatives you've considered
I'm new to tokenizers so I'm not sure if what I'm doing right now is correct but I'm trying to use a BpeTokenizer passing vocab and merges files and the special tokens (not straightforward because for example I'm reading this file https://huggingface.co/onnx-community/whisper-large-v3-turbo/blob/main/special_tokens_map.json and I need to read vocab file too to get the max id to know where to start from to map special token to id number)

The linked repository has even a tokenizer.json that I suppose contains already everything without the need to pass vocab and merges, but I don't see a way to use it out of the box (I haven't find a constructor that accepts a tokenizer.json file)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions