Skip to content

[WIP] Add MMS and Wav2Vec2 models (Closes #209) #220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
87fc35b
Add example `wav2vec2` models
xenova Jul 26, 2023
93f5963
Add support for `CTCDecoder` and `Wav2Vec2CTCTokenizer`
xenova Jul 27, 2023
2fd7b07
Generate tokenizer.json files for wav2vec2 models
xenova Jul 27, 2023
c749376
Fix wav2vec2 custom tokenizer generation
xenova Jul 27, 2023
2519b16
Implement wav2vec2 audio-speech-recognition
xenova Jul 27, 2023
77aa309
Add `Wav2Vec2` as a supported architecture
xenova Jul 27, 2023
e90ba2f
Update README.md
xenova Jul 27, 2023
57fd46a
Update generate_tests.py
xenova Jul 27, 2023
f93bb82
Ignore invalid tests
xenova Jul 27, 2023
4a686ad
Update supported wav2vec2 models
xenova Jul 27, 2023
3723122
Update supported_models.py
xenova Jul 27, 2023
33aa220
Simplify pipeline construction
xenova Jul 27, 2023
ac817a2
Implement basic audio classification pipeline
xenova Jul 27, 2023
52a7f73
Update default topk value for audio classification pipeline
xenova Jul 28, 2023
793bed2
Add example usage for the audio classification pipeline
xenova Jul 28, 2023
e4d10d3
Move `loadAudio` to utils file
xenova Jul 28, 2023
874cac2
Add audio classification unit test
xenova Jul 28, 2023
593a80d
Add wav2vec2 ASR unit test
xenova Jul 28, 2023
dd80669
Improve generated wav2vec2 tokenizer json
xenova Jul 29, 2023
b203ccf
Update supported_models.py
xenova Jul 29, 2023
54fa19c
Allow `added_tokens_regex` to be null
xenova Jul 29, 2023
0681846
Support exporting mms vocabs
xenova Jul 29, 2023
7a728ef
Supported nested vocabularies
xenova Jul 29, 2023
578cf49
Merge branch 'main' into mms
xenova Aug 1, 2023
909eca2
Update supported tasks and models
xenova Aug 1, 2023
b1a4eea
Add warnings to ignore language and task for wav2vec2 models
xenova Aug 1, 2023
d56eb76
Mark internal methods as private
xenova Aug 1, 2023
333ea9e
Add typing to audio variable
xenova Aug 1, 2023
bbc3106
Update node-audio-processing.mdx
xenova Aug 1, 2023
70f5baf
Move node-audio-processing to guides
xenova Aug 1, 2023
218efe8
Update table of contents
xenova Aug 1, 2023
5982307
Merge branch 'main' into mms
xenova Aug 13, 2023
1e31d50
Add example code for performing feature extraction w/ `Wav2Vec2Model`
xenova Aug 13, 2023
870147b
Refactor `Pipeline` class params
xenova Aug 13, 2023
c29bdea
Fix `pipeline` function
xenova Aug 13, 2023
3fc98dd
Fix typo in `pipeline` JSDoc
xenova Aug 14, 2023
4969c80
Fix second typo
xenova Aug 14, 2023
9f2ec15
Merge branch 'main' into mms
xenova Aug 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te

| Task | ID | Description | Supported? |
|--------------------------|----|-------------|------------|
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | |
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | |
| [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio) | n/a | Generating audio from an input audio source. | ❌ |
| [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) | `automatic-speech-recognition` | Transcribing a given audio into text. | ✅ |
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | n/a | Generating natural-sounding speech given text input. | ❌ |
Expand Down Expand Up @@ -268,6 +268,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
Expand All @@ -278,6 +279,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.

Expand Down
2 changes: 1 addition & 1 deletion docs/snippets/5_supported-tasks.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@

| Task | ID | Description | Supported? |
|--------------------------|----|-------------|------------|
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | |
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | |
| [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio) | n/a | Generating audio from an input audio source. | ❌ |
| [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) | `automatic-speech-recognition` | Transcribing a given audio into text. | ✅ |
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | n/a | Generating natural-sounding speech given text input. | ❌ |
Expand Down
2 changes: 2 additions & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
Expand All @@ -26,6 +27,7 @@
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.

4 changes: 2 additions & 2 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,12 @@
title: Building an Electron Application
- local: tutorials/node
title: Server-side Inference in Node.js
- local: tutorials/node-audio-processing
title: Server-side Audio Processing in Node.js
title: Tutorials
- sections:
- local: guides/private
title: Accessing Private/Gated Models
- local: guides/node-audio-processing
title: Server-side Audio Processing in Node.js
title: Developer Guides
- sections:
- local: api/transformers
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,17 @@ wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
let audioData = wav.getSamples();
if (Array.isArray(audioData)) {
// For this demo, if there are multiple channels for the audio file, we just select the first one.
// In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
audioData = audioData[0];
if (audioData.length > 1) {
const SCALING_FACTOR = Math.sqrt(2);

// Merge channels (into first channel to save memory)
for (let i = 0; i < audioData[0].length; ++i) {
audioData[0][i] = SCALING_FACTOR * (audioData[0][i] + audioData[1][i]) / 2;
}
}

// Select first channel
audioData = audioData[0];
}
```

Expand Down
12 changes: 10 additions & 2 deletions examples/node-audio-processing/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,16 @@ wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
let audioData = wav.getSamples();
if (Array.isArray(audioData)) {
// For this demo, if there are multiple channels for the audio file, we just select the first one.
// In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
if (audioData.length > 1) {
const SCALING_FACTOR = Math.sqrt(2);

// Merge channels (into first channel to save memory)
for (let i = 0; i < audioData[0].length; ++i) {
audioData[0][i] = SCALING_FACTOR * (audioData[0][i] + audioData[1][i]) / 2;
}
}

// Select first channel
audioData = audioData[0];
}

Expand Down
21 changes: 19 additions & 2 deletions scripts/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@
}
}

MODELS_WITHOUT_TOKENIZERS = [
'wav2vec2'
]


@dataclass
class ConversionArguments:
Expand Down Expand Up @@ -212,12 +216,16 @@ def main():

tokenizer = None
try:
# Save tokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

except KeyError:
pass # No Tokenizer

except Exception as e:
if config.model_type not in MODELS_WITHOUT_TOKENIZERS:
raise e

export_kwargs = dict(
model_name_or_path=model_id,
output=output_model_folder,
Expand All @@ -233,7 +241,7 @@ def main():
tokenizer_json = generate_tokenizer_json(model_id, tokenizer)

with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
json.dump(tokenizer_json, fp)
json.dump(tokenizer_json, fp, indent=4)

elif config.model_type == 'whisper':
if conv_args.output_attentions:
Expand All @@ -242,6 +250,15 @@ def main():
export_kwargs.update(
**get_main_export_kwargs(config, "automatic-speech-recognition")
)

elif config.model_type == 'wav2vec2':
if tokenizer is not None:
from .extra.wav2vec2 import generate_tokenizer_json
tokenizer_json = generate_tokenizer_json(tokenizer)

with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
json.dump(tokenizer_json, fp, indent=4)

else:
pass # TODO

Expand Down
58 changes: 58 additions & 0 deletions scripts/extra/wav2vec2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@

def generate_tokenizer_json(tokenizer):
vocab = tokenizer.vocab

special_tokens_vocab = vocab
if "<pad>" not in tokenizer.vocab:
# For MMS tokenizers, the vocab is of the form:
# {
# language_id: { language_vocab }
# }
# So, to get the list of special tokens, we just get the english vocab
special_tokens_vocab = vocab['eng']

tokenizer_json = {
"version": "1.0",
"truncation": None,
"padding": None,
"added_tokens": [
{
"id": v,
"content": k,
"single_word": False,
"lstrip": False,
"rstrip": False,
"normalized": False,
"special": True
}
for k, v in special_tokens_vocab.items()
if k.startswith('<') and k.endswith('>')
],
"normalizer": {
"type": "Replace",
"pattern": {
"String": " "
},
"content": "|"
},
"pre_tokenizer": {
"type": "Split",
"pattern": {
"Regex": ""
},
"behavior": "Isolated",
"invert": False
},
"post_processor": None,
"decoder": {
"type": "CTC",
"pad_token": "<pad>",
"word_delimiter_token": "|",
"cleanup": True
},
"model": {
"vocab": vocab
}
}

return tokenizer_json
22 changes: 22 additions & 0 deletions scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,28 @@
'facebook/dino-vitb8',
'facebook/dino-vits16',
],
'wav2vec2': [
# feature extraction # NOTE: requires --task feature-extraction
'facebook/mms-300m',
'facebook/mms-1b',

# audio classification
'alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech',
'superb/wav2vec2-base-superb-ks',
'facebook/mms-lid-126',
'facebook/mms-lid-256',
'facebook/mms-lid-512',
'facebook/mms-lid-1024',
'facebook/mms-lid-2048',
'facebook/mms-lid-4017',

# speech recognition
'jonatasgrosman/wav2vec2-large-xlsr-53-english',
'facebook/wav2vec2-base-960h',
'facebook/mms-1b-l1107',
'facebook/mms-1b-all',
'facebook/mms-1b-fl102',
],
'whisper': [
'openai/whisper-tiny',
'openai/whisper-tiny.en',
Expand Down
Loading