raunakdoesdev
diff --git a/‎README.md
Lines changed: 3 additions & 1 deletion b/‎README.md
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/snippets/5_supported-tasks.snippet
Lines changed: 1 addition & 1 deletion b/‎docs/snippets/5_supported-tasks.snippet
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/snippets/6_supported-models.snippet
Lines changed: 2 additions & 0 deletions b/‎docs/snippets/6_supported-models.snippet
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/_toctree.yml
Lines changed: 2 additions & 2 deletions b/‎docs/source/_toctree.yml
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/tutorials/node-audio-processing.mdx renamed to ‎docs/source/guides/node-audio-processing.mdx
Lines changed: 11 additions & 3 deletions b/‎docs/source/tutorials/node-audio-processing.mdx renamed to ‎docs/source/guides/node-audio-processing.mdx
Lines changed: 11 additions & 3 deletions
diff --git a/‎examples/node-audio-processing/index.js
Lines changed: 10 additions & 2 deletions b/‎examples/node-audio-processing/index.js
Lines changed: 10 additions & 2 deletions
diff --git a/‎scripts/convert.py
Lines changed: 19 additions & 2 deletions b/‎scripts/convert.py
Lines changed: 19 additions & 2 deletions
diff --git a/‎scripts/extra/wav2vec2.py
Lines changed: 58 additions & 0 deletions b/‎scripts/extra/wav2vec2.py
Lines changed: 58 additions & 0 deletions
diff --git a/‎scripts/supported_models.py
Lines changed: 22 additions & 0 deletions b/‎scripts/supported_models.py
Lines changed: 22 additions & 0 deletions
@@ -217,7 +217,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 
 | Task                     | ID | Description | Supported? |
 |--------------------------|----|-------------|------------|
-| [Audio Classification](https://huggingface.co/tasks/audio-classification)         |  `audio-classification`  | Assigning a label or class to a given audio. | ❌ |
+| [Audio Classification](https://huggingface.co/tasks/audio-classification)         |  `audio-classification`  | Assigning a label or class to a given audio. | ✅ |
 | [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio)         |  n/a  | Generating audio from an input audio source. | ❌ |
 | [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition)         | `automatic-speech-recognition`  | Transcribing a given audio into text. | ✅ |
 | [Text-to-Speech](https://huggingface.co/tasks/text-to-speech)         |  n/a  | Generating natural-sounding speech given text input. | ❌ |
@@ -268,6 +268,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
 1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
+1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
 1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
@@ -278,6 +279,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
 1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
+1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
 1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
 
 
@@ -35,7 +35,7 @@
 
 | Task                     | ID | Description | Supported? |
 |--------------------------|----|-------------|------------|
-| [Audio Classification](https://huggingface.co/tasks/audio-classification)         |  `audio-classification`  | Assigning a label or class to a given audio. | ❌ |
+| [Audio Classification](https://huggingface.co/tasks/audio-classification)         |  `audio-classification`  | Assigning a label or class to a given audio. | ✅ |
 | [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio)         |  n/a  | Generating audio from an input audio source. | ❌ |
 | [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition)         | `automatic-speech-recognition`  | Transcribing a given audio into text. | ✅ |
 | [Text-to-Speech](https://huggingface.co/tasks/text-to-speech)         |  n/a  | Generating natural-sounding speech given text input. | ❌ |
 
@@ -16,6 +16,7 @@
 1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
 1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
+1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.
 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
 1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari.
 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.
@@ -26,6 +27,7 @@
 1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
+1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
 1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
 
@@ -19,12 +19,12 @@
     title: Building an Electron Application
   - local: tutorials/node
     title: Server-side Inference in Node.js
-  - local: tutorials/node-audio-processing
-    title: Server-side Audio Processing in Node.js
   title: Tutorials
 - sections:
   - local: guides/private
     title: Accessing Private/Gated Models
+  - local: guides/node-audio-processing
+    title: Server-side Audio Processing in Node.js
   title: Developer Guides
 - sections:
   - local: api/transformers
 
@@ -73,9 +73,17 @@ wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
 wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
 let audioData = wav.getSamples();
 if (Array.isArray(audioData)) {
-    // For this demo, if there are multiple channels for the audio file, we just select the first one.
-    // In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
-    audioData = audioData[0];
+  if (audioData.length > 1) {
+    const SCALING_FACTOR = Math.sqrt(2);
+
+    // Merge channels (into first channel to save memory)
+    for (let i = 0; i < audioData[0].length; ++i) {
+      audioData[0][i] = SCALING_FACTOR * (audioData[0][i] + audioData[1][i]) / 2;
+    }
+  }
+
+  // Select first channel
+  audioData = audioData[0];
 }
 ```
 
 
@@ -14,8 +14,16 @@ wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
 wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
 let audioData = wav.getSamples();
 if (Array.isArray(audioData)) {
-    // For this demo, if there are multiple channels for the audio file, we just select the first one.
-    // In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
+    if (audioData.length > 1) {
+        const SCALING_FACTOR = Math.sqrt(2);
+
+        // Merge channels (into first channel to save memory)
+        for (let i = 0; i < audioData[0].length; ++i) {
+            audioData[0][i] = SCALING_FACTOR * (audioData[0][i] + audioData[1][i]) / 2;
+        }
+    }
+
+    // Select first channel
     audioData = audioData[0];
 }
 
 
@@ -32,6 +32,10 @@
     }
 }
 
+MODELS_WITHOUT_TOKENIZERS = [
+    'wav2vec2'
+]
+
 
 @dataclass
 class ConversionArguments:
@@ -212,12 +216,16 @@ def main():
 
     tokenizer = None
     try:
-        # Save tokenizer
+        # Load tokenizer
         tokenizer = AutoTokenizer.from_pretrained(model_id)
 
     except KeyError:
         pass  # No Tokenizer
 
+    except Exception as e:
+        if config.model_type not in MODELS_WITHOUT_TOKENIZERS:
+            raise e
+
     export_kwargs = dict(
         model_name_or_path=model_id,
         output=output_model_folder,
@@ -233,7 +241,7 @@ def main():
         tokenizer_json = generate_tokenizer_json(model_id, tokenizer)
 
         with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
-            json.dump(tokenizer_json, fp)
+            json.dump(tokenizer_json, fp, indent=4)
 
     elif config.model_type == 'whisper':
         if conv_args.output_attentions:
@@ -242,6 +250,15 @@ def main():
             export_kwargs.update(
                 **get_main_export_kwargs(config, "automatic-speech-recognition")
             )
+
+    elif config.model_type == 'wav2vec2':
+        if tokenizer is not None:
+            from .extra.wav2vec2 import generate_tokenizer_json
+            tokenizer_json = generate_tokenizer_json(tokenizer)
+
+            with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
+                json.dump(tokenizer_json, fp, indent=4)
+
     else:
         pass  # TODO
 
 
@@ -0,0 +1,58 @@
+
+def generate_tokenizer_json(tokenizer):
+    vocab = tokenizer.vocab
+
+    special_tokens_vocab = vocab
+    if "<pad>" not in tokenizer.vocab:
+        # For MMS tokenizers, the vocab is of the form:
+        # {
+        #   language_id: { language_vocab }
+        # }
+        # So, to get the list of special tokens, we just get the english vocab
+        special_tokens_vocab = vocab['eng']
+
+    tokenizer_json = {
+        "version": "1.0",
+        "truncation": None,
+        "padding": None,
+        "added_tokens": [
+            {
+                "id": v,
+                "content": k,
+                "single_word": False,
+                "lstrip": False,
+                "rstrip": False,
+                "normalized": False,
+                "special": True
+            }
+            for k, v in special_tokens_vocab.items()
+            if k.startswith('<') and k.endswith('>')
+        ],
+        "normalizer": {
+            "type": "Replace",
+            "pattern": {
+                "String": " "
+            },
+            "content": "|"
+        },
+        "pre_tokenizer": {
+            "type": "Split",
+            "pattern": {
+                "Regex": ""
+            },
+            "behavior": "Isolated",
+            "invert": False
+        },
+        "post_processor": None,
+        "decoder": {
+            "type": "CTC",
+            "pad_token": "<pad>",
+            "word_delimiter_token": "|",
+            "cleanup": True
+        },
+        "model": {
+            "vocab": vocab
+        }
+    }
+
+    return tokenizer_json
@@ -226,6 +226,28 @@
         'facebook/dino-vitb8',
         'facebook/dino-vits16',
     ],
+    'wav2vec2': [
+        # feature extraction # NOTE: requires --task feature-extraction
+        'facebook/mms-300m',
+        'facebook/mms-1b',
+
+        # audio classification
+        'alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech',
+        'superb/wav2vec2-base-superb-ks',
+        'facebook/mms-lid-126',
+        'facebook/mms-lid-256',
+        'facebook/mms-lid-512',
+        'facebook/mms-lid-1024',
+        'facebook/mms-lid-2048',
+        'facebook/mms-lid-4017',
+
+        # speech recognition
+        'jonatasgrosman/wav2vec2-large-xlsr-53-english',
+        'facebook/wav2vec2-base-960h',
+        'facebook/mms-1b-l1107',
+        'facebook/mms-1b-all',
+        'facebook/mms-1b-fl102',
+    ],
     'whisper': [
         'openai/whisper-tiny',
         'openai/whisper-tiny.en',