Skip to content

Commit 58e8f3a

Browse files
authored
Support various Whisper model with Metal backend (#113)
* fix: rm model_name, executorch #15798 * feat: add model_name param to support various whisper models * feat: support standard model's FEATURE_SIZE with 80 * docs: update model_name requirement, different FEATURE_SIZE and various model support
1 parent e6de006 commit 58e8f3a

File tree

4 files changed

+124
-16
lines changed

4 files changed

+124
-16
lines changed

metal/whisper/README.md

Lines changed: 91 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ The Metal backend scripts are located under `metal/`
2121
## Whisper Example
2222

2323
The Whisper example demonstrates how to:
24+
2425
1. Set up your environment
2526
2. Export the Whisper model to ExecuTorch format
2627
3. Build the Whisper Metal runner
@@ -37,7 +38,8 @@ metal/whisper/e2e.sh \
3738
--setup-env \
3839
--export \
3940
--build \
40-
--run --audio-path <audio_path>
41+
--run --audio-path <audio_path> \
42+
[--model-name <model_name>]
4143
```
4244

4345
**Required Arguments:**
@@ -46,7 +48,13 @@ metal/whisper/e2e.sh \
4648
- `<env_name>` - Name of the conda environment to create (e.g., `whisper-example`)
4749
- `<audio_path>` - Path to your audio file for inference (e.g., `~/Desktop/audio.wav`)
4850

49-
**Example:**
51+
**Optional Arguments:**
52+
53+
- `<model_name>` - HuggingFace Whisper model name (default: `openai/whisper-large-v3-turbo`)
54+
- Available models: `openai/whisper-tiny`, `openai/whisper-base`, `openai/whisper-small`,
55+
`openai/whisper-medium`, `openai/whisper-large`, `openai/whisper-large-v3-turbo`
56+
57+
**Example (default large-v3-turbo model):**
5058

5159
```bash
5260
metal/whisper/e2e.sh \
@@ -60,11 +68,29 @@ metal/whisper/e2e.sh \
6068
--run --audio-path ~/Desktop/audio.wav
6169
```
6270

71+
**Example (using small model):**
72+
73+
```bash
74+
metal/whisper/e2e.sh \
75+
--artifact-dir ~/Desktop/whisper \
76+
--env-name whisper-example \
77+
--clone-et \
78+
--create-env \
79+
--setup-env \
80+
--export \
81+
--model-name openai/whisper-small \
82+
--build \
83+
--run --audio-path ~/Desktop/audio.wav
84+
```
85+
6386
This will automatically:
87+
6488
1. Setup the environment:
65-
- Clone the executorch repo
66-
- Create a conda environment named `whisper-example`
67-
- Install all dependencies
89+
90+
- Clone the executorch repo
91+
- Create a conda environment named `whisper-example`
92+
- Install all dependencies
93+
6894
2. Export the Whisper model to the `~/Desktop/whisper` directory
6995
3. Build the Whisper Metal runner
7096
4. Run inference on `~/Desktop/whisper/audio.wav`
@@ -86,14 +112,29 @@ conda activate <env_name>
86112
#### Step 2: Export the Model
87113

88114
```bash
89-
/path/to/metal/whisper/export.sh <artifact_dir>
115+
/path/to/metal/whisper/export.sh <artifact_dir> [model_name]
90116
```
91117

92118
**Arguments:**
93-
- `<artifact_dir>` - Directory to store exported model files (e.g., `~/Desktop/whisper`)
119+
120+
- `<artifact_dir>` - Directory to store exported model files (required, e.g., `~/Desktop/whisper`)
121+
- `[model_name]` - HuggingFace Whisper model name (optional, default: `openai/whisper-large-v3-turbo`)
122+
- Available models: `openai/whisper-tiny`, `openai/whisper-base`, `openai/whisper-small`,
123+
`openai/whisper-medium`, `openai/whisper-large`, `openai/whisper-large-v3-turbo`
124+
125+
**Examples:**
126+
127+
```bash
128+
# Export default large-v3-turbo model
129+
/path/to/metal/whisper/export.sh ~/Desktop/whisper
130+
131+
# Export small model
132+
/path/to/metal/whisper/export.sh ~/Desktop/whisper openai/whisper-small
133+
```
94134

95135
This will:
96-
- Download the Whisper model
136+
137+
- Download the specified Whisper model from HuggingFace
97138
- Export it to ExecuTorch format with Metal optimizations
98139
- Save model files (`.pte`), metadata, and preprocessor to the specified directory
99140

@@ -110,11 +151,51 @@ This will:
110151
```
111152

112153
**Arguments:**
113-
- `<audio_path>` - Path to your audio file (e.g., `/path/to/audio.wav`)
114-
- `<artifact_dir>` - Directory containing exported model files (e.g., `~/Desktop/whisper`)
154+
155+
- `<audio_path>` - Path to your audio file (required, e.g., `/path/to/audio.wav`)
156+
- `<artifact_dir>` - Directory containing exported model files (required, e.g., `~/Desktop/whisper`)
157+
158+
**Example:**
159+
160+
```bash
161+
/path/to/metal/whisper/run.sh ~/Desktop/audio.wav ~/Desktop/whisper
162+
```
115163

116164
This will:
165+
117166
- Validate that all required model files exist
118167
- Load the model and preprocessor
119168
- Run inference on the provided audio
120169
- Display timing information
170+
171+
## Available Whisper Models
172+
173+
The following Whisper models are supported:
174+
175+
| Model Name | HuggingFace ID (for export) | Parameters | Mel Features | Relative Speed | Use Case |
176+
| -------------- | ------------------------------- | ---------- | ------------ | -------------- | ------------------------------ |
177+
| Tiny | `openai/whisper-tiny` | 39M | 80 | Fastest | Quick transcription, real-time |
178+
| Base | `openai/whisper-base` | 74M | 80 | Very Fast | Good balance for real-time |
179+
| Small | `openai/whisper-small` | 244M | 80 | Fast | Recommended for most use cases |
180+
| Medium | `openai/whisper-medium` | 769M | 80 | Moderate | Higher accuracy needed |
181+
| Large | `openai/whisper-large` | 1550M | 80 | Slower | Best accuracy |
182+
| Large V3 | `openai/whisper-large-v3` | 1550M | **128** | Slower | Latest architecture |
183+
| Large V3 Turbo | `openai/whisper-large-v3-turbo` | 809M | **128** | Fast | Default, good balance |
184+
185+
### Mel Features Configuration
186+
187+
The export scripts automatically configure the correct mel feature size based on the model:
188+
189+
- **80 mel features**: Used by all standard models (tiny, base, small, medium, large, large-v2)
190+
- **128 mel features**: Used only by large-v3 and large-v3-turbo variants
191+
192+
**Important:** The preprocessor must match the model's expected feature size, or you'll encounter tensor shape mismatch errors. The export scripts handle this automatically.
193+
194+
### Tokenizer Configuration
195+
196+
**Important Note:** All Whisper models downloaded from HuggingFace now use the updated tokenizer format where:
197+
198+
- Token `50257` = `<|endoftext|>`
199+
- Token `50258` = `<|startoftranscript|>` (used as `decoder_start_token_id`)
200+
201+
The whisper_runner automatically uses `decoder_start_token_id=50258` for all models, so you don't need to worry about tokenizer compatibility when exporting and running any Whisper variant.

metal/whisper/e2e.sh

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ EXECUTORCH_PATH=""
1818
ARTIFACT_DIR=""
1919
ENV_NAME=""
2020
AUDIO_PATH=""
21+
MODEL_NAME="openai/whisper-large-v3-turbo"
2122

2223
echo "Current script path: $(realpath "$0")"
2324
SCRIPT_DIR="$(realpath "$(dirname "$(realpath "$0")")")"
@@ -37,13 +38,15 @@ usage() {
3738
echo " --create-env Create the Python environment"
3839
echo " --setup-env Set up the Python environment"
3940
echo " --export Export the Whisper model"
41+
echo " --model-name NAME HuggingFace model name (optional, default: openai/whisper-large-v3-turbo)"
4042
echo " --build Build the Whisper runner"
4143
echo " --audio-path PATH Path to the input audio file"
4244
echo " --run Run the Whisper model"
4345
echo " -h, --help Display this help message"
4446
echo ""
4547
echo "Example:"
4648
echo " $0 --env-name metal-backend --setup-env --export --build --audio-path audio.wav --run"
49+
echo " $0 --env-name metal-backend --export --model-name openai/whisper-small --build --audio-path audio.wav --run"
4750
exit 1
4851
}
4952

@@ -78,6 +81,10 @@ while [[ $# -gt 0 ]]; do
7881
EXPORT=true
7982
shift
8083
;;
84+
--model-name)
85+
MODEL_NAME="$2"
86+
shift 2
87+
;;
8188
--build)
8289
BUILD=true
8390
shift
@@ -160,8 +167,9 @@ fi
160167
# Execute export
161168
if [ "$EXPORT" = true ]; then
162169
echo "Exporting Whisper model to $ARTIFACT_DIR ..."
170+
echo " - Model: $MODEL_NAME"
163171
echo " - Script: $SCRIPT_DIR/export.sh"
164-
conda run -n "$ENV_NAME" "$SCRIPT_DIR/export.sh" "$ARTIFACT_DIR"
172+
conda run -n "$ENV_NAME" "$SCRIPT_DIR/export.sh" "$ARTIFACT_DIR" "$MODEL_NAME"
165173
fi
166174

167175
# Execute build

metal/whisper/export.sh

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,29 +6,49 @@
66
# LICENSE file in the root directory of this source tree.
77

88
ARTIFACT_DIR="$1"
9+
MODEL_NAME="${2:-openai/whisper-large-v3-turbo}"
910

1011
if [ -z "$ARTIFACT_DIR" ]; then
1112
echo "Error: ARTIFACT_DIR argument not provided."
12-
echo "Usage: $0 <ARTIFACT_DIR>"
13+
echo "Usage: $0 <ARTIFACT_DIR> [MODEL_NAME]"
14+
echo ""
15+
echo "Arguments:"
16+
echo " <ARTIFACT_DIR> Directory to store exported model files (required)"
17+
echo " [MODEL_NAME] HuggingFace model name (optional, default: openai/whisper-large-v3-turbo)"
18+
echo ""
19+
echo "Example:"
20+
echo " $0 ~/Desktop/whisper openai/whisper-small"
1321
exit 1
1422
fi
1523

1624
mkdir -p "$ARTIFACT_DIR"
1725

26+
echo "Exporting model: $MODEL_NAME"
27+
28+
# Determine feature_size based on model name
29+
# large-v3 and large-v3-turbo use 128 mel features, all others use 80
30+
if [[ "$MODEL_NAME" == *"large-v3"* ]]; then
31+
FEATURE_SIZE=128
32+
echo "Using feature_size=128 for large-v3/large-v3-turbo model"
33+
else
34+
FEATURE_SIZE=80
35+
echo "Using feature_size=80 for standard Whisper model"
36+
fi
37+
1838
optimum-cli export executorch \
19-
--model "openai/whisper-large-v3-turbo" \
39+
--model "$MODEL_NAME" \
2040
--task "automatic-speech-recognition" \
2141
--recipe "metal" \
2242
--dtype bfloat16 \
2343
--output_dir "$ARTIFACT_DIR"
2444

2545
python -m executorch.extension.audio.mel_spectrogram \
26-
--feature_size 128 \
46+
--feature_size $FEATURE_SIZE \
2747
--stack_output \
2848
--max_audio_len 300 \
2949
--output_file "$ARTIFACT_DIR"/whisper_preprocessor.pte
3050

31-
TOKENIZER_URL="https://huggingface.co/openai/whisper-large-v3-turbo/resolve/main"
51+
TOKENIZER_URL="https://huggingface.co/$MODEL_NAME/resolve/main"
3252

3353
curl -L $TOKENIZER_URL/tokenizer.json -o $ARTIFACT_DIR/tokenizer.json
3454
curl -L $TOKENIZER_URL/tokenizer_config.json -o $ARTIFACT_DIR/tokenizer_config.json

metal/whisper/run.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,5 +45,4 @@ done
4545
--tokenizer_path "$ARTIFACT_DIR"/ \
4646
--audio_path "$INPUT_AUDIO_PATH" \
4747
--processor_path "$ARTIFACT_DIR"/whisper_preprocessor.pte \
48-
--model_name "large-v3-turbo" \
4948
--temperature 0

0 commit comments

Comments
 (0)