Support various Whisper model with Metal backend (#113)

seyeong-han · web-flow · commit 58e8f3ae4541 · 2025-11-14T19:52:51.000-05:00
* fix: rm model_name, executorch #15798

* feat: add model_name param to support various whisper models

* feat: support standard model's FEATURE_SIZE with 80

* docs: update model_name requirement, different FEATURE_SIZE and various model support
diff --git a/metal/whisper/README.md b/metal/whisper/README.md
@@ -21,6 +21,7 @@ The Metal backend scripts are located under `metal/`
 ## Whisper Example
 
 The Whisper example demonstrates how to:
+
 1. Set up your environment
 2. Export the Whisper model to ExecuTorch format
 3. Build the Whisper Metal runner
@@ -37,7 +38,8 @@ metal/whisper/e2e.sh \
   --setup-env \
   --export \
   --build \
-  --run --audio-path <audio_path>
+  --run --audio-path <audio_path> \
+  [--model-name <model_name>]
 ```
 
 **Required Arguments:**
@@ -46,7 +48,13 @@ metal/whisper/e2e.sh \
 - `<env_name>` - Name of the conda environment to create (e.g., `whisper-example`)
 - `<audio_path>` - Path to your audio file for inference (e.g., `~/Desktop/audio.wav`)
 
-**Example:**
+**Optional Arguments:**
+
+- `<model_name>` - HuggingFace Whisper model name (default: `openai/whisper-large-v3-turbo`)
+  - Available models: `openai/whisper-tiny`, `openai/whisper-base`, `openai/whisper-small`,
+    `openai/whisper-medium`, `openai/whisper-large`, `openai/whisper-large-v3-turbo`
+
+**Example (default large-v3-turbo model):**
 
 ```bash
 metal/whisper/e2e.sh \
@@ -60,11 +68,29 @@ metal/whisper/e2e.sh \
   --run --audio-path ~/Desktop/audio.wav
 ```
 
+**Example (using small model):**
+
+```bash
+metal/whisper/e2e.sh \
+  --artifact-dir ~/Desktop/whisper \
+  --env-name whisper-example \
+  --clone-et \
+  --create-env \
+  --setup-env \
+  --export \
+  --model-name openai/whisper-small \
+  --build \
+  --run --audio-path ~/Desktop/audio.wav
+```
+
 This will automatically:
+
 1. Setup the environment:
-  - Clone the executorch repo
-  - Create a conda environment named `whisper-example`
-  - Install all dependencies
+
+- Clone the executorch repo
+- Create a conda environment named `whisper-example`
+- Install all dependencies
+
 2. Export the Whisper model to the `~/Desktop/whisper` directory
 3. Build the Whisper Metal runner
 4. Run inference on `~/Desktop/whisper/audio.wav`
@@ -86,14 +112,29 @@ conda activate <env_name>
 #### Step 2: Export the Model
 
 ```bash
-/path/to/metal/whisper/export.sh <artifact_dir>
+/path/to/metal/whisper/export.sh <artifact_dir> [model_name]
 ```
 
 **Arguments:**
-- `<artifact_dir>` - Directory to store exported model files (e.g., `~/Desktop/whisper`)
+
+- `<artifact_dir>` - Directory to store exported model files (required, e.g., `~/Desktop/whisper`)
+- `[model_name]` - HuggingFace Whisper model name (optional, default: `openai/whisper-large-v3-turbo`)
+  - Available models: `openai/whisper-tiny`, `openai/whisper-base`, `openai/whisper-small`,
+    `openai/whisper-medium`, `openai/whisper-large`, `openai/whisper-large-v3-turbo`
+
+**Examples:**
+
+```bash
+# Export default large-v3-turbo model
+/path/to/metal/whisper/export.sh ~/Desktop/whisper
+
+# Export small model
+/path/to/metal/whisper/export.sh ~/Desktop/whisper openai/whisper-small
+```
 
 This will:
-- Download the Whisper model
+
+- Download the specified Whisper model from HuggingFace
 - Export it to ExecuTorch format with Metal optimizations
 - Save model files (`.pte`), metadata, and preprocessor to the specified directory
 
@@ -110,11 +151,51 @@ This will:
 ```
 
 **Arguments:**
-- `<audio_path>` - Path to your audio file (e.g., `/path/to/audio.wav`)
-- `<artifact_dir>` - Directory containing exported model files (e.g., `~/Desktop/whisper`)
+
+- `<audio_path>` - Path to your audio file (required, e.g., `/path/to/audio.wav`)
+- `<artifact_dir>` - Directory containing exported model files (required, e.g., `~/Desktop/whisper`)
+
+**Example:**
+
+```bash
+/path/to/metal/whisper/run.sh ~/Desktop/audio.wav ~/Desktop/whisper
+```
 
 This will:
+
 - Validate that all required model files exist
 - Load the model and preprocessor
 - Run inference on the provided audio
 - Display timing information
+
+## Available Whisper Models
+
+The following Whisper models are supported:
+
+| Model Name     | HuggingFace ID (for export)     | Parameters | Mel Features | Relative Speed | Use Case                       |
+| -------------- | ------------------------------- | ---------- | ------------ | -------------- | ------------------------------ |
+| Tiny           | `openai/whisper-tiny`           | 39M        | 80           | Fastest        | Quick transcription, real-time |
+| Base           | `openai/whisper-base`           | 74M        | 80           | Very Fast      | Good balance for real-time     |
+| Small          | `openai/whisper-small`          | 244M       | 80           | Fast           | Recommended for most use cases |
+| Medium         | `openai/whisper-medium`         | 769M       | 80           | Moderate       | Higher accuracy needed         |
+| Large          | `openai/whisper-large`          | 1550M      | 80           | Slower         | Best accuracy                  |
+| Large V3       | `openai/whisper-large-v3`       | 1550M      | **128**      | Slower         | Latest architecture            |
+| Large V3 Turbo | `openai/whisper-large-v3-turbo` | 809M       | **128**      | Fast           | Default, good balance          |
+
+### Mel Features Configuration
+
+The export scripts automatically configure the correct mel feature size based on the model:
+
+- **80 mel features**: Used by all standard models (tiny, base, small, medium, large, large-v2)
+- **128 mel features**: Used only by large-v3 and large-v3-turbo variants
+
+**Important:** The preprocessor must match the model's expected feature size, or you'll encounter tensor shape mismatch errors. The export scripts handle this automatically.
+
+### Tokenizer Configuration
+
+**Important Note:** All Whisper models downloaded from HuggingFace now use the updated tokenizer format where:
+
+- Token `50257` = `<|endoftext|>`
+- Token `50258` = `<|startoftranscript|>` (used as `decoder_start_token_id`)
+
+The whisper_runner automatically uses `decoder_start_token_id=50258` for all models, so you don't need to worry about tokenizer compatibility when exporting and running any Whisper variant.
diff --git a/metal/whisper/e2e.sh b/metal/whisper/e2e.sh
@@ -18,6 +18,7 @@ EXECUTORCH_PATH=""
 ARTIFACT_DIR=""
 ENV_NAME=""
 AUDIO_PATH=""
+MODEL_NAME="openai/whisper-large-v3-turbo"
 
 echo "Current script path: $(realpath "$0")"
 SCRIPT_DIR="$(realpath "$(dirname "$(realpath "$0")")")"
@@ -37,13 +38,15 @@ usage() {
   echo "  --create-env           Create the Python environment"
   echo "  --setup-env            Set up the Python environment"
   echo "  --export               Export the Whisper model"
+  echo "  --model-name NAME      HuggingFace model name (optional, default: openai/whisper-large-v3-turbo)"
   echo "  --build                Build the Whisper runner"
   echo "  --audio-path PATH      Path to the input audio file"
   echo "  --run                  Run the Whisper model"
   echo "  -h, --help             Display this help message"
   echo ""
   echo "Example:"
   echo "  $0 --env-name metal-backend --setup-env --export --build --audio-path audio.wav --run"
+  echo "  $0 --env-name metal-backend --export --model-name openai/whisper-small --build --audio-path audio.wav --run"
   exit 1
 }
 
@@ -78,6 +81,10 @@ while [[ $# -gt 0 ]]; do
       EXPORT=true
       shift
       ;;
+    --model-name)
+      MODEL_NAME="$2"
+      shift 2
+      ;;
     --build)
       BUILD=true
       shift
@@ -160,8 +167,9 @@ fi
 # Execute export
 if [ "$EXPORT" = true ]; then
   echo "Exporting Whisper model to $ARTIFACT_DIR ..."
+  echo " - Model: $MODEL_NAME"
   echo " - Script: $SCRIPT_DIR/export.sh"
-  conda run -n "$ENV_NAME" "$SCRIPT_DIR/export.sh" "$ARTIFACT_DIR"
+  conda run -n "$ENV_NAME" "$SCRIPT_DIR/export.sh" "$ARTIFACT_DIR" "$MODEL_NAME"
 fi
 
 # Execute build
diff --git a/metal/whisper/export.sh b/metal/whisper/export.sh
@@ -6,29 +6,49 @@
 # LICENSE file in the root directory of this source tree.
 
 ARTIFACT_DIR="$1"
+MODEL_NAME="${2:-openai/whisper-large-v3-turbo}"
 
 if [ -z "$ARTIFACT_DIR" ]; then
     echo "Error: ARTIFACT_DIR argument not provided."
-    echo "Usage: $0 <ARTIFACT_DIR>"
+    echo "Usage: $0 <ARTIFACT_DIR> [MODEL_NAME]"
+    echo ""
+    echo "Arguments:"
+    echo "  <ARTIFACT_DIR>  Directory to store exported model files (required)"
+    echo "  [MODEL_NAME]    HuggingFace model name (optional, default: openai/whisper-large-v3-turbo)"
+    echo ""
+    echo "Example:"
+    echo "  $0 ~/Desktop/whisper openai/whisper-small"
     exit 1
 fi
 
 mkdir -p "$ARTIFACT_DIR"
 
+echo "Exporting model: $MODEL_NAME"
+
+# Determine feature_size based on model name
+# large-v3 and large-v3-turbo use 128 mel features, all others use 80
+if [[ "$MODEL_NAME" == *"large-v3"* ]]; then
+    FEATURE_SIZE=128
+    echo "Using feature_size=128 for large-v3/large-v3-turbo model"
+else
+    FEATURE_SIZE=80
+    echo "Using feature_size=80 for standard Whisper model"
+fi
+
 optimum-cli export executorch \
-            --model "openai/whisper-large-v3-turbo" \
+            --model "$MODEL_NAME" \
             --task "automatic-speech-recognition" \
             --recipe "metal" \
             --dtype bfloat16 \
             --output_dir "$ARTIFACT_DIR"
 
 python -m executorch.extension.audio.mel_spectrogram \
-            --feature_size 128 \
+            --feature_size $FEATURE_SIZE \
             --stack_output \
             --max_audio_len 300 \
             --output_file "$ARTIFACT_DIR"/whisper_preprocessor.pte
 
-TOKENIZER_URL="https://huggingface.co/openai/whisper-large-v3-turbo/resolve/main"
+TOKENIZER_URL="https://huggingface.co/$MODEL_NAME/resolve/main"
 
 curl -L $TOKENIZER_URL/tokenizer.json -o $ARTIFACT_DIR/tokenizer.json
 curl -L $TOKENIZER_URL/tokenizer_config.json -o $ARTIFACT_DIR/tokenizer_config.json
diff --git a/metal/whisper/run.sh b/metal/whisper/run.sh
@@ -45,5 +45,4 @@ done
       --tokenizer_path "$ARTIFACT_DIR"/ \
       --audio_path "$INPUT_AUDIO_PATH" \
       --processor_path "$ARTIFACT_DIR"/whisper_preprocessor.pte \
-      --model_name "large-v3-turbo" \
       --temperature 0