Fine-tuning Whisper #759

ehsantaati · 2022-12-28T15:31:21Z

ehsantaati
Dec 28, 2022

I am trying to fine-tune the whisper to improve the WER for a simulated telephone records in English. I am using the "small model" and a dataset of around 32 hours in English with the audio duration of 8 seconds on average.

I unfreeze the decoder's attention blocks from the last block. However, while the fine-tuning performs well on the validation and test sets (just by fine-tuning the last blocks), I am getting poor WER for longer speeches ( for example a 1 minutes audio). I use the transcribe method for transcribing longer audios.

Any suggestion to improve the wer on longer speeches?

DavraYoung · 2022-12-29T11:25:50Z

DavraYoung
Dec 29, 2022

I would suggest you to split your audio in to smaller chunks, because Whisper cannot process audio longer than 30s.

3 replies

DavraYoung Dec 29, 2022

btw, how you fine tune whisper? Is it Huggingface transformers?

KlentyAbuBakker Jan 20, 2023

@DavraYoung would splitting the audio manually into 30s chunks change the tone of the transcription? And in the cases of overlap how would the model handle if we manually separate the audio into chunks ?

youssefanjjar Aug 22, 2024

try to speed it up

sanchit-gandhi · 2023-01-20T11:30:43Z

sanchit-gandhi
Jan 20, 2023

Hey @ehsantaati! Cool to see that you're fine-tuning Whisper for telephone recordings. This Colab nicely explains how you can use Transformer's pipeline method to transcribe audio samples > 30s: https://colab.research.google.com/drive/1l290cRv4RdvuLNlSeo9WexByHaNWs3s3?usp=sharing

You can play around a bit with the chunk_length_s parameter and find a value that works best for your data (setting it to 30s works well in general, might be worth trying something lower given your training data is avg 8s duration).

1 reply

Ace-myu Jan 18, 2024

Hello sir,

Is there a way to enable "condition_on_previous_text" on Transformer Whisper? I have encountered hallucinations and repeated text issue and I'd like to test out this parameter.

Bumicom · 2023-01-20T15:11:18Z

Bumicom
Jan 20, 2023

@sanchit-gandhi
In the Whisper paper "4.5. Strategies for Reliable Long-form Transcription" is see the researchers use a much more sophisticated that a simple 30's chunk length. Do you know how to enable timecode in the huggingface version of the model?

10 replies

sanchit-gandhi Feb 1, 2023

For timestamps, simply set return_timestamps=True, see https://huggingface.co/openai/whisper-tiny#long-form-transcription:

out = pipe(audio, return_timestamps=True)["chunks"]
print(out)

For translation, we simply need to set task="translate" when we call pipe:

out = pipe(audio, return_timestamps=True, task="translate")["chunks"]
print(out)

Just make sure you have installed transformers from main for this to work:

pip install git+https://github.com/huggingface/transformers

Bumicom Feb 2, 2023

@sanchit-gandhi
Thank you for the information. Using chunk_length_s=30 resulted in repetition of word. Using shorter chunk length seem to work. I will do some additional testing.

sanchit-gandhi Feb 2, 2023

Lovely! Glad to hear it!

dgoryeo Feb 2, 2023

Hi @sanchit-gandhi , is chunk length a feature of HF pipleline, or is that also possible in the standard Whisper transcribe? I haven't used HF in the past and am wondering if I should make that step for the additional benefits/features.

sanchit-gandhi Feb 10, 2023

It's a hyper parameter in OpenAI's implementation, but's it hardcoded to 30s and not exposed to the user:

whisper/whisper/audio.py

Line 17 in 7858aa9

CHUNK_LENGTH = 30

With HF's, we allow the user to control the chunk length. We've found that chunk_length_s=30 tends to work best. See https://huggingface.co/openai/whisper-medium#long-form-transcription for more details.

marcvie · 2023-02-12T14:07:32Z

marcvie
Feb 12, 2023

Whisper (large model) transcribes drug names wrong quite often when used to transcribe medical audio files. Is there some way that I can add say a medical dictionary (in text format) or any other way to improve the accuracy of drug names?

1 reply

sanchit-gandhi Feb 17, 2023

Does this help? https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311/2?u=nbroad

marcvie · 2023-02-17T16:04:25Z

marcvie
Feb 17, 2023

Thanks. Will give it a try.

…

On Fri, Feb 17, 2023 at 7:50 PM Sanchit Gandhi ***@***.***> wrote: Does this help? https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311/2?u=nbroad — Reply to this email directly, view it on GitHub <#759 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWI6XTJFJHSKTTHXJZJF42TWX6CMBANCNFSM6AAAAAATLLKZKM> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

toanhuynhnguyen Oct 19, 2024

Did you get better results for drug names?

jaggzh · 2023-04-28T19:35:21Z

jaggzh
Apr 28, 2023

Might I ask for scripts/instructions on fine tuning?
My wife is breathing on a ventilator and her voice is not recognized by any software I've used. I've done some NNs of my own with minor success.
I need to know the training data format and setup, and the method/scripts to train the model(s).
I'm having trouble finding any references on how to do this.
Thanks so much.

4 replies

DavraYoung Apr 28, 2023

@jaggzh check this tutorial: https://huggingface.co/blog/fine-tune-whisper

For traning data you need a labeled dataset with audios and texts that those audios represent

sanchit-gandhi May 5, 2023

Useful guide for getting a custom audio dataset into HF datasets format: https://huggingface.co/docs/datasets/audio_dataset

jaggzh Nov 6, 2023

Like I asked, I need to train it locally on my own system so as not to upload private voice recordings of someone else outside.

PankajBarai Aug 20, 2024

@jaggzh wanted to know, have you finetune the model locally?

monk1337 · 2023-07-27T08:20:41Z

monk1337
Jul 27, 2023

@sanchit-gandhi , @DavraYoung If I am planning to collect custom data to fine-tune the whisper model, what things I need to keep in mind while collecting audio data, can you help with the setup configuration such as:

What should be the average length of the audio clip?
Shall we use the same mic or phone or more than two mics, what is the ideal number?
what format, which sample rate etc

Please let me know other things to keep in mind before collecting large data?

15 replies

monk1337 Oct 31, 2023

@DavraYoung I've recently created a dataset using speech-to-text APIs on custom documents. The dataset consists of 1,000 audio samples, with 700 designated for training and 300 for testing. In total, this equates to about 4 hours of audio, where each clip is approximately 30 seconds long.

I'm attempting to fine-tune the Whisper small model with the help of HuggingFace's script, following the tutorial they've provided Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers.

Before diving into the fine-tuning, I evaluated the WER on OpenAI's pre-trained model, which stood at WER = 23.078%.

However, as my fine-tuning progresses, I'm observing some unexpected behavior:

As visible, the Validation Loss and WER are both on the rise during the fine-tuning phase. I'm at a bit of a loss here. Why might this be happening? Any insights or recommendations would be greatly appreciated.

Thank you in advance!

DavraYoung Nov 1, 2023

Check what is being fed to the model.
Specifically:

tokens, make sure the tokens are correct
make sure you use appropriate task related tokens
check if your audio has correct sample rate, before passing it to the feature extractor. It must be 16000
if you are using pydub and mp3, make sure to supply higher nitrates for export function(by default, it's set to 64k, which in my case lose details)

If everything is fine, reduce your learning rate(too high lr can result in increasing wer)

Your dataset is too small, consider augmentation and dropout techniques to improve the situation.

Do not use too high epochs count. In my case I use no more than 4-5 epochs, otherwise the model overfits and the accuracy diverge from validation

DavraYoung Nov 1, 2023

@monk1337

monk1337 Dec 25, 2023

I have custom text data for plant disease names and plant names like this:

uuid, context 
1er1hhaj13, The Rhododendron, a popular ornamental plant, often suffers from Phytophthora ramorum, a challenging disease to manage and pronounce. This pathogen causes Sudden Oak Death, which can lead to extensive damage and mortality in infected plants.

I used speech-to-text APIs to convert this context into audio WAV files, choosing 10 speakers with mostly American/UK/British accents. So I created around ~5k samples for training and ~2k samples for testing.

I followed the same steps from "Fast whisper finetuning" to finetune the peft version of Whisper Large-v2. The training and validation loss looks good:

Step | Training Loss | Validation Loss
250 | 0.413000 | 0.102663
500 | 0.109900 | 0.130888
750 | 0.116500 | 0.102719
1000 | 0.092800 | 0.099153
1250 | 0.068800 | 0.075613 
1500 | 0.042500 | 0.085680
1750 | 0.047500 | 0.076951
2000 | 0.027500 | 0.065127
2250 | 0.023700 | 0.061832
2500 | 0.012500 | 0.062658
2750 | 0.011500 | 0.061922
3000 | 0.008500 | 0.061463
3250 | 0.005300 | 0.060227
3500 | 0.003800 | 0.060712
3750 | 0.002700 | 0.060332
4000 | 0.002300 | 0.060496

When I calculated WER on the test data:

OpenAI Whisper APIs: 22.03 WER on test data
Finetuned model: 0.3 WER on test data

Which looks good. However, during real-time testing with an Indian English-speaking audience, the accuracy for plant names and disease names was not satisfactory. What strategies could we employ to improve accuracy in real-time settings?
Any guidance or suggestions on this matter would be greatly appreciated. Thank you!

@DavraYoung

asr-lord Sep 20, 2024

@monk1337 hi. Our experiments has shown that model accuracy increases when we train it with context tokens (whisper startofprev token + context tokens) given the same amount of audio hours, model see more text and learns more about language.

Regarding silence finetuning, this seems to be a good approach for fixing silence hallucination issues

Hi @monk1337, do you have any sample code for training with context tokens? I'm interested. Thank you

rampedro · 2023-08-12T07:01:35Z

rampedro
Aug 12, 2023

Hello all,

I'm currently attempting to employ my own customized dataset (accessible here: https://huggingface.co/datasets/pedramaa/arabic-llm-egyption) for the purpose of fine-tuning in the realm of whisper transcription tasks.

Having encountered significant challenges, primarily involving the selection of an appropriate environment and the formatting of my data to align with the structures found in common_voice datasets, I have finally managed to configure my local notebook with GPU support. However, I've hit a roadblock at the final step, specifically within the line containing trainer.train(). This is where I'm encountering a variety of versioning and accelerator-related errors. On occasion, I also come across an error indicating that the feature_extractor is undefined, despite my attempts to import it again within the prepare_dataset method.

Despite my efforts to follow the guidance provided in the article titled "https://huggingface.co/blog/fine-tune-whisper", I'm encountering difficulties when it comes to executing a custom fine-tuning process. In my most recent attempt, I encountered an error within Google Colab that seemed to require either the addition of an "updating accelerator" or the installation of tensorflow[torch].

If any of you have insights or ideas to offer that could potentially assist me in overcoming these challenges, I would greatly appreciate your input. Thank you

3 replies

gaganmanku96 Aug 14, 2023

@rampedro I faced a similar issue.
Here is what you can try:

!pip install -U accelerate
and then restart runtime.

You don't need to install the packages after that in that runtime session.

sanchit-gandhi Sep 6, 2023

This guide should help in getting your dataset into the right format: https://huggingface.co/docs/datasets/audio_dataset

I would recommend that you follow this guide, and then upload your dataset to the Hub with .push_to_hub.

Once you execute this, it will be in a very similar to the common voice dataset. You will be able to load the dataset from the hub by specifying the correct repo id where you've saved it, and should then be able to run the fine-tuning script without any further changes

johnatanebonilla Dec 27, 2023

@rampedro @sanchit-gandhi Hi, I'm having this problem with accelerate, even after installing :( any idea? thanks in advance

ImportError Traceback (most recent call last)
in <cell line: 3>()
1 from transformers import Seq2SeqTrainingArguments
2
----> 3 training_args = Seq2SeqTrainingArguments(
4 output_dir="./whisper-small-canario", # change to a repo name of your choice
5 per_device_train_batch_size=16,

4 frames
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py in _setup_devices(self)
1785 if not is_sagemaker_mp_enabled():
1786 if not is_accelerate_available(min_version="0.20.1"):
-> 1787 raise ImportError(
1788 "Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U"
1789 )

ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

jaggzh · 2023-08-17T06:45:49Z

jaggzh
Aug 17, 2023

My issue is that I'm trying to fine-tune for someone who breaths on a ventilator. Their utterances are short phrases of several words maximum. Is whisper at all going to be suitable for them?

…

On Mon, Aug 14, 2023, 4:38 AM Gagandeep Singh ***@***.***> wrote: @rampedro <https://github.com/rampedro> I faced a similar issue. Here is what you can try: !pip install -U accelerate and then restart runtime. You don't need to install the packages after that in that runtime session. — Reply to this email directly, view it on GitHub <#759 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3AV7PGOMGL756M5ZWQOATXVIE5FANCNFSM6AAAAAATLLKZKM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

3 replies

DavraYoung Aug 18, 2023

there is no clear answer to that. You should definately try finetuning an see if whisper will get better.

glangford Aug 20, 2023

@jaggzh fyi in case this is helpful.

CloudVent: speech recognition for people with paralysis using ventilators

https://www.hra.nhs.uk/planning-and-improving-research/application-summaries/research-summaries/cloudvent-v10/

https://catch.sites.sheffield.ac.uk/projects/cloudvent

jaggzh Oct 31, 2023

By the way, it turns out their project was old and not continued (or something like that, they told me). But thanks for the lead, it was worth a shot.

mathiasfoster · 2023-08-31T03:08:40Z

mathiasfoster
Aug 31, 2023

@sanchit-gandhi I have been having real trouble following https://huggingface.co/blog/fine-tune-whisper with my own dataset.

See my dataset here: https://huggingface.co/datasets/MathiasFoster/whisper-data
No matter what I try, I run into some error when trying to load this dataset, and fudge the associated notebook around to make it fit.
How would you finetune Whisper on this dataset?
(I don't need a test dataset at this point, I just want to get the model to the point of training it!)

7 replies

mathiasfoster Sep 6, 2023

I can see it would help, @sanchit-gandhi, but I struggled with creating the dataset.
Following the local files method, where do we add in the transcriptions data?
What format should my files be in, and how do I load them?

sanchit-gandhi Sep 7, 2023

For the transcription data, if you have a list of your transcriptions, you can use .add_column to append them to your audio dataset: https://discuss.huggingface.co/t/how-to-add-a-new-column-to-a-dataset/6453/2

Regarding the format of your audio files, they can be in any standard audio format (.wav/.mp3/.flac) - datasets will use the soundfile package to read your data, meaning if it is compatible with soundfile, it can be loaded as an audio dataset: https://pysoundfile.readthedocs.io/en/latest/

You don't need to worry about loading them yourself, just specify the paths when you instantiate the dataset

mathiasfoster Sep 7, 2023

Thank you for helping me and putting up with my slowness.
Is there any way to specify a directory, instead of specifying each file individually?
This will greatly speed it up – and I will let you know how it goes!

sanchit-gandhi Sep 27, 2023

Note that you just need to specify the paths to the audio files, not the loaded audio files themselves. Loading the audio files will be equally fast whether you specify the paths to the audio files, or the directory where they are saved. In both circumstances, you need to load the same number of audio files, so the runtime will be the same.

mathiasfoster Nov 2, 2023

Hey @sanchit-gandhi,
I have a dataset of audio files less than 30 seconds long, in Parquet format... created by:

from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="..\Downloads\call-recordings-v4\WAV", drop_labels=True)

dataset.push_to_hub(repo_id="<username>/<repo>", token="<token>", private=True)

I want to stream this into a Colab, create transcriptions for these using my fine-tuned Whisper model, and save these transcriptions to the dataset.
The plan is then to download the dataset, correct the transcriptions manually, and then further finetune the Whisper model on the corrected transcripts.
I'm not sure how to go about this - could you help here?
Step one is creating transcriptions and saving it back to the dataset! Or is there a better way of doing this?

ILG2021 · 2023-09-26T21:09:05Z

ILG2021
Sep 26, 2023

Hello, everyone, I want to know if I can use audio data mix with multiple languages for finetune. Like a audio someone speaks with a part of English and a part of Chinese. Can it also be used or I should avoid it? But I think it is hard for the tokenizer to face this text.

0 replies

Ashutosh-4485 · 2023-09-28T14:35:40Z

Ashutosh-4485
Sep 28, 2023

Is there any way to use whisper for real-time speech recognition ?

1 reply

jaggzh Oct 30, 2023

You can try my https://github.com/jaggzh/whisperpluck project. I coded it for someone who has problems using the keyboard. At present it's designed to use a ui with buttons to trigger recording, transcription, and copy it to the clipboard (it doesn't insert it anywhere automatically).
There's a variable in the whisper-auto script for choosing whether it will run whisper each time it processes it, or use the included server that keeps a model loaded.
whisper-auto can also be used from external scripts, but it needs some way to terminate the recording (which is currently done with just a run of 'arecord', although it'd be nice to modify that and use sox's rec with silence detection).

Additionally, I wrote https://github.com/jaggzh/kbinsert which can insert text as if typed on the keyboard (linux only). It builds kbinsert which works in the terminal, and kbinsertx (use with kbinsertx -g) to "type" it wherever you are in X11. This is not incorporated into the whisperpluck project right now though.

johnatanebonilla · 2023-12-27T03:27:47Z

johnatanebonilla
Dec 27, 2023

Hi @sanchit-gandhi, @DavraYoung,

I'm currently working on fine-tuning the Whisper model to transcribe Spanish rural dialects, focusing on phonological aspects like elision and concatenation. My goal is to preserve the spoken disfluencies for linguistic analysis. I've been using @sanchit-gandhi's Colab notebook for fine-tuning but encountered an issue during the training parameter setup.

Error Encountered:
While defining the training parameters, I received an ImportError despite installing the necessary libraries. Here's the error message:

ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U
I've already tried installing the mentioned libraries, but the issue persists.

Could you provide any insights on resolving this ImportError?
Additionally, how can I achieve both phonological and orthographic transcriptions, given that the Spanish model is already trained?
Any help or guidance would be greatly appreciated!

0 replies

afsara-ben · 2024-05-21T22:06:40Z

afsara-ben
May 21, 2024

any way to finetune the original openai-whisper model without using huggingface abstractions?

1 reply

pplantinga Jul 15, 2024

You can train with SpeechBrain, no huggingface abstractions needed: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer

hkstemcenter · 2024-06-05T17:10:17Z

hkstemcenter
Jun 5, 2024

https://mobiusml.github.io/whisper-static-cache-blog/

0 replies

jaggzh · 2024-10-12T01:38:13Z

jaggzh
Oct 12, 2024

I'm fine-tuning for a patient whose voice is whispery, and breaths on a ventilator. That is, it's airy and like white noise. I'm wondering if I should include the natural noises in the training (like the patient coughing, murmering, etc.) so the model can learn this is not voice?
That is, at present their voice is very different from most people, and sounds like a weak cough will trigger the fine-tuned model to output words.
I have two approaches, labeling the coughs, like '<|cough|>', or just labeling the text in the areas, while letting the audio be a part of the recording.

0 replies

toanhuynhnguyen · 2024-10-19T14:01:57Z

toanhuynhnguyen
Oct 19, 2024

Based on this guide https://huggingface.co/blog/fine-tune-whisper, I tried to fine-tune "small" and "large-v3" models:

The fine-tuned small model works normally, it can transcribe English, Malay, and Chinese.
But the fine-tuned large-v3 model works poorly on Malay, and Chinese. Ex: I have a Chinese audio file, it does not transcribe Chinese but it auto-translates Chinese to English though I specified transcribing in Chinese. Have you faced this issue and can give me some advice, thank you so much.

0 replies

Fine-tuning Whisper #759

Replies: 17 comments · 50 replies

Replies: 17 comments 50 replies