Whisper

Available models and languages

There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model. The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~10x
base	74 M	`base.en`	`base`	~1 GB	~7x
small	244 M	`small.en`	`small`	~2 GB	~4x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x
turbo	809 M	N/A	`turbo`	~6 GB	~8x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models. Additionally, the turbo model is an optimized version of large-v3 that offers faster transcription speed with a minimal degradation in accuracy.

Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of large-v3 and large-v2 models by language, using WERs (word error rates) or CER (character error rates, shown in Italic) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

Usage

Transcription can also be performed within Python:

import whisper

model = whisper.load_model("tiny")
result = model.transcribe("audio.mp3")

Internally, the transcribe() method reads the entire file and processes the audio with a sliding 30-second window, performing autoregressive sequence-to-sequence predictions on each window.

Below is an example usage of whisper.detect_language() and whisper.decode() which provide lower-level access to the model.

import whisper

model = whisper.load_model("turbo")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# detect the spoken language
with torch.no_grad():
    lang_tokens, lang_probs, lang_logits = model.detect_language(mel)

print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
with torch.no_grad():
    options = whisper.DecodingOptions()
    result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)
print(result.tokens)
print(result.token_logits)
print(result.language_logits)

Teacher-forced inference can be done as follows.

from whisper.tokenizer import get_tokenizer

tokenizer = get_tokenizer(
    model.is_multilingual, num_languages=model.num_languages
)
tf_tokens = torch.tensor(tokenizer.encode("tomas is the best, said marian."))
# tf_tokens.shape --> (10,)

options = whisper.DecodingOptions(sample_len=1)
result = whisper.decode(model, mel, tf_tokens=tf_tokens, options=options)
# result.token_logits.shape --> (14,) because 3 leading special tokens and 1 end of transcript token

License

Whisper's code and model weights are released under the MIT License. See LICENSE for further details.

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
.github		.github
data		data
notebooks		notebooks
tests		tests
whisper		whisper
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
approach.png		approach.png
language-breakdown.svg		language-breakdown.svg
model-card.md		model-card.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whisper

Available models and languages

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Whisper

Available models and languages

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages