Skip to content

Commit d479953

Browse files
authored
[WIP] Add MMS and Wav2Vec2 models (Closes huggingface#209) (huggingface#220)
* Add example `wav2vec2` models * Add support for `CTCDecoder` and `Wav2Vec2CTCTokenizer` * Generate tokenizer.json files for wav2vec2 models * Fix wav2vec2 custom tokenizer generation * Implement wav2vec2 audio-speech-recognition * Add `Wav2Vec2` as a supported architecture * Update README.md * Update generate_tests.py * Ignore invalid tests * Update supported wav2vec2 models * Update supported_models.py * Simplify pipeline construction * Implement basic audio classification pipeline * Update default topk value for audio classification pipeline * Add example usage for the audio classification pipeline * Move `loadAudio` to utils file * Add audio classification unit test * Add wav2vec2 ASR unit test * Improve generated wav2vec2 tokenizer json * Update supported_models.py * Allow `added_tokens_regex` to be null * Support exporting mms vocabs * Supported nested vocabularies * Update supported tasks and models * Add warnings to ignore language and task for wav2vec2 models Will add in future * Mark internal methods as private * Add typing to audio variable * Update node-audio-processing.mdx * Move node-audio-processing to guides * Update table of contents * Add example code for performing feature extraction w/ `Wav2Vec2Model` NOTE: feature extraction of MMS models is currently broken in the python library, but it works correctly here. See huggingface/transformers#25485 for more info * Refactor `Pipeline` class params * Fix `pipeline` function * Fix typo in `pipeline` JSDoc * Fix second typo
1 parent 060ac83 commit d479953

18 files changed

+791
-147
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,7 +217,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
217217

218218
| Task | ID | Description | Supported? |
219219
|--------------------------|----|-------------|------------|
220-
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | |
220+
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | |
221221
| [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio) | n/a | Generating audio from an input audio source. ||
222222
| [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) | `automatic-speech-recognition` | Transcribing a given audio into text. ||
223223
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | n/a | Generating natural-sounding speech given text input. ||
@@ -268,6 +268,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
268268
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
269269
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
270270
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
271+
1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
271272
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
272273
1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
273274
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
@@ -278,6 +279,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
278279
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
279280
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
280281
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
282+
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
281283
1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
282284
1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
283285

docs/snippets/5_supported-tasks.snippet

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535

3636
| Task | ID | Description | Supported? |
3737
|--------------------------|----|-------------|------------|
38-
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | |
38+
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | |
3939
| [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio) | n/a | Generating audio from an input audio source. | ❌ |
4040
| [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) | `automatic-speech-recognition` | Transcribing a given audio into text. | ✅ |
4141
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | n/a | Generating natural-sounding speech given text input. | ❌ |

docs/snippets/6_supported-models.snippet

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
1717
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1818
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
19+
1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
1920
1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
2021
1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
2122
1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
@@ -26,6 +27,7 @@
2627
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
2728
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
2829
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
30+
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
2931
1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
3032
1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
3133

docs/source/_toctree.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,12 @@
1919
title: Building an Electron Application
2020
- local: tutorials/node
2121
title: Server-side Inference in Node.js
22-
- local: tutorials/node-audio-processing
23-
title: Server-side Audio Processing in Node.js
2422
title: Tutorials
2523
- sections:
2624
- local: guides/private
2725
title: Accessing Private/Gated Models
26+
- local: guides/node-audio-processing
27+
title: Server-side Audio Processing in Node.js
2828
title: Developer Guides
2929
- sections:
3030
- local: api/transformers

docs/source/tutorials/node-audio-processing.mdx renamed to docs/source/guides/node-audio-processing.mdx

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,17 @@ wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
7373
wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
7474
let audioData = wav.getSamples();
7575
if (Array.isArray(audioData)) {
76-
// For this demo, if there are multiple channels for the audio file, we just select the first one.
77-
// In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
78-
audioData = audioData[0];
76+
if (audioData.length > 1) {
77+
const SCALING_FACTOR = Math.sqrt(2);
78+
79+
// Merge channels (into first channel to save memory)
80+
for (let i = 0; i < audioData[0].length; ++i) {
81+
audioData[0][i] = SCALING_FACTOR * (audioData[0][i] + audioData[1][i]) / 2;
82+
}
83+
}
84+
85+
// Select first channel
86+
audioData = audioData[0];
7987
}
8088
```
8189

examples/node-audio-processing/index.js

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,16 @@ wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
1414
wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
1515
let audioData = wav.getSamples();
1616
if (Array.isArray(audioData)) {
17-
// For this demo, if there are multiple channels for the audio file, we just select the first one.
18-
// In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
17+
if (audioData.length > 1) {
18+
const SCALING_FACTOR = Math.sqrt(2);
19+
20+
// Merge channels (into first channel to save memory)
21+
for (let i = 0; i < audioData[0].length; ++i) {
22+
audioData[0][i] = SCALING_FACTOR * (audioData[0][i] + audioData[1][i]) / 2;
23+
}
24+
}
25+
26+
// Select first channel
1927
audioData = audioData[0];
2028
}
2129

scripts/convert.py

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,10 @@
3232
}
3333
}
3434

35+
MODELS_WITHOUT_TOKENIZERS = [
36+
'wav2vec2'
37+
]
38+
3539

3640
@dataclass
3741
class ConversionArguments:
@@ -212,12 +216,16 @@ def main():
212216

213217
tokenizer = None
214218
try:
215-
# Save tokenizer
219+
# Load tokenizer
216220
tokenizer = AutoTokenizer.from_pretrained(model_id)
217221

218222
except KeyError:
219223
pass # No Tokenizer
220224

225+
except Exception as e:
226+
if config.model_type not in MODELS_WITHOUT_TOKENIZERS:
227+
raise e
228+
221229
export_kwargs = dict(
222230
model_name_or_path=model_id,
223231
output=output_model_folder,
@@ -233,7 +241,7 @@ def main():
233241
tokenizer_json = generate_tokenizer_json(model_id, tokenizer)
234242

235243
with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
236-
json.dump(tokenizer_json, fp)
244+
json.dump(tokenizer_json, fp, indent=4)
237245

238246
elif config.model_type == 'whisper':
239247
if conv_args.output_attentions:
@@ -242,6 +250,15 @@ def main():
242250
export_kwargs.update(
243251
**get_main_export_kwargs(config, "automatic-speech-recognition")
244252
)
253+
254+
elif config.model_type == 'wav2vec2':
255+
if tokenizer is not None:
256+
from .extra.wav2vec2 import generate_tokenizer_json
257+
tokenizer_json = generate_tokenizer_json(tokenizer)
258+
259+
with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
260+
json.dump(tokenizer_json, fp, indent=4)
261+
245262
else:
246263
pass # TODO
247264

scripts/extra/wav2vec2.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
2+
def generate_tokenizer_json(tokenizer):
3+
vocab = tokenizer.vocab
4+
5+
special_tokens_vocab = vocab
6+
if "<pad>" not in tokenizer.vocab:
7+
# For MMS tokenizers, the vocab is of the form:
8+
# {
9+
# language_id: { language_vocab }
10+
# }
11+
# So, to get the list of special tokens, we just get the english vocab
12+
special_tokens_vocab = vocab['eng']
13+
14+
tokenizer_json = {
15+
"version": "1.0",
16+
"truncation": None,
17+
"padding": None,
18+
"added_tokens": [
19+
{
20+
"id": v,
21+
"content": k,
22+
"single_word": False,
23+
"lstrip": False,
24+
"rstrip": False,
25+
"normalized": False,
26+
"special": True
27+
}
28+
for k, v in special_tokens_vocab.items()
29+
if k.startswith('<') and k.endswith('>')
30+
],
31+
"normalizer": {
32+
"type": "Replace",
33+
"pattern": {
34+
"String": " "
35+
},
36+
"content": "|"
37+
},
38+
"pre_tokenizer": {
39+
"type": "Split",
40+
"pattern": {
41+
"Regex": ""
42+
},
43+
"behavior": "Isolated",
44+
"invert": False
45+
},
46+
"post_processor": None,
47+
"decoder": {
48+
"type": "CTC",
49+
"pad_token": "<pad>",
50+
"word_delimiter_token": "|",
51+
"cleanup": True
52+
},
53+
"model": {
54+
"vocab": vocab
55+
}
56+
}
57+
58+
return tokenizer_json

scripts/supported_models.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,28 @@
226226
'facebook/dino-vitb8',
227227
'facebook/dino-vits16',
228228
],
229+
'wav2vec2': [
230+
# feature extraction # NOTE: requires --task feature-extraction
231+
'facebook/mms-300m',
232+
'facebook/mms-1b',
233+
234+
# audio classification
235+
'alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech',
236+
'superb/wav2vec2-base-superb-ks',
237+
'facebook/mms-lid-126',
238+
'facebook/mms-lid-256',
239+
'facebook/mms-lid-512',
240+
'facebook/mms-lid-1024',
241+
'facebook/mms-lid-2048',
242+
'facebook/mms-lid-4017',
243+
244+
# speech recognition
245+
'jonatasgrosman/wav2vec2-large-xlsr-53-english',
246+
'facebook/wav2vec2-base-960h',
247+
'facebook/mms-1b-l1107',
248+
'facebook/mms-1b-all',
249+
'facebook/mms-1b-fl102',
250+
],
229251
'whisper': [
230252
'openai/whisper-tiny',
231253
'openai/whisper-tiny.en',

0 commit comments

Comments
 (0)