-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add Voice Activity Detection and Speaker Diarization support #1398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Introduced VAD functionality to filter silent audio regions, improving transcription efficiency. - Added speaker diarization capabilities using pyannote.audio, allowing identification of speakers in multi-speaker audio. - Updated CLI and README to reflect new features and usage examples. - Enhanced transcribe function to support VAD and diarization options. - Implemented RTTM format output for diarization results. Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Voice Activity Detection (VAD) and speaker diarization capabilities to mlx-whisper. VAD filters silent audio regions before transcription to improve efficiency, while speaker diarization identifies who is speaking when in multi-speaker audio using pyannote.audio.
- VAD integration using Silero VAD with configurable options (threshold, silence duration, padding)
- Speaker diarization using pyannote.audio with speaker assignment to transcript segments and words
- RTTM format output support for diarization results
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| whisper/setup.py | Adds optional dependencies for VAD (torch) and diarization (pyannote.audio>=3.1, pandas, torch) |
| whisper/mlx_whisper/writers.py | Implements RTTM writer for diarization output and adds speaker labels to VTT/SRT subtitles |
| whisper/mlx_whisper/vad.py | New module implementing Silero VAD with speech detection, chunk extraction, and timestamp mapping |
| whisper/mlx_whisper/transcribe.py | Extends transcribe function with VAD preprocessing, adds transcribe_with_diarization function, implements segment deduplication and filtering |
| whisper/mlx_whisper/diarize.py | New module providing pyannote.audio integration for speaker diarization with speaker-to-segment assignment |
| whisper/mlx_whisper/cli.py | Adds CLI arguments for VAD (--vad-filter, thresholds) and diarization (--diarize, speaker counts, HF token) |
| whisper/mlx_whisper/init.py | Exports new transcribe_with_diarization function and optionally imports vad/diarize modules |
| whisper/README.md | Documents VAD and diarization features with usage examples and requirements |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| Args: | ||
| model_name: HuggingFace model ID for diarization | ||
| use_auth_token: HuggingFace token (required for gated models) |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name in the docstring is use_auth_token, but the actual parameter is named token. Update the docstring to match the parameter name.
| use_auth_token: HuggingFace token (required for gated models) | |
| token: HuggingFace token (required for gated models) |
| # Use bisect_left: find the chunk this sample falls within | ||
| chunk_idx = bisect.bisect_left(self.chunk_end_sample, sample) |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of bisect_left here may not correctly find the chunk that contains the sample. bisect_left returns the leftmost insertion point, but for finding which chunk a sample falls within, you likely need bisect_right or handle the boundary conditions differently.
For example, if chunk_end_sample = [1000, 2000, 3000] and sample = 1000, bisect_left returns 0, but the sample at 1000 should belong to the first chunk (index 0), not before it. However, if sample = 1001, bisect_left returns 1, which would incorrectly skip to the second chunk's silence offset.
Consider using bisect_right instead, which would correctly identify that sample 1000 belongs to chunk 0, and sample 1001 also belongs to chunk 1.
| # Use bisect_left: find the chunk this sample falls within | |
| chunk_idx = bisect.bisect_left(self.chunk_end_sample, sample) | |
| # Use bisect_right: find the chunk this sample falls within | |
| chunk_idx = bisect.bisect_right(self.chunk_end_sample, sample) |
| import pandas as pd | ||
|
|
||
| _PANDAS_AVAILABLE = True | ||
| except ImportError: | ||
| pd = None | ||
|
|
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'pd' is not used.
| import pandas as pd | |
| _PANDAS_AVAILABLE = True | |
| except ImportError: | |
| pd = None | |
| import pandas | |
| _PANDAS_AVAILABLE = True | |
| except ImportError: | |
| pass |
| from pyannote.audio import Pipeline | ||
|
|
||
| _PYANNOTE_AVAILABLE = True | ||
| except ImportError: | ||
| Pipeline = None |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Pipeline' is not used.
| from pyannote.audio import Pipeline | |
| _PYANNOTE_AVAILABLE = True | |
| except ImportError: | |
| Pipeline = None | |
| import pyannote.audio | |
| _PYANNOTE_AVAILABLE = True | |
| except ImportError: | |
| pyannote = None |
whisper/mlx_whisper/vad.py
Outdated
| # Graceful dependency handling | ||
| _TORCH_AVAILABLE = False | ||
| try: | ||
| import torch |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'torch' is not used.
| try: | ||
| if hasattr(diarization_output, "exclusive_speaker_diarization") and len(ann) == 0: # type: ignore[arg-type] | ||
| ann = diarization_output.exclusive_speaker_diarization | ||
| except Exception: |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| except Exception: | |
| # Ignore all exceptions here: fallback to exclusive_speaker_diarization is optional, | |
| # and if it fails, we simply return the original (possibly empty) annotation. |
whisper/mlx_whisper/vad.py
Outdated
| import torch | ||
|
|
||
| _TORCH_AVAILABLE = True | ||
| except ImportError: |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except ImportError: | |
| except ImportError: | |
| # torch is optional; if not available, set _TORCH_AVAILABLE to False and continue. |
Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>
- Add BeamSearchDecoder class ported from OpenAI whisper (torch to MLX) - Support beam_size and patience parameters in DecodingOptions - Update DecodingTask to use BeamSearchDecoder when beam_size is provided - Handle different return types between GreedyDecoder (mx.array) and BeamSearchDecoder (List[List[List[int]]]) in run() - Remove _sanitize_decoding_options() workaround from transcribe.py - Add comprehensive unit tests for beam search decoder - Update README with beam search CLI and API documentation - Add .gitignore to exclude test artifacts The patience parameter controls early stopping via max_candidates = round(beam_size * patience)
…VAD and diarization Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>
Introduce Voice Activity Detection (VAD) to enhance transcription efficiency by filtering silent audio regions. Implement speaker diarization using pyannote.audio for identifying speakers in multi-speaker audio. Update CLI and documentation to reflect these new features and provide usage examples. Enhance the transcribe function to support both VAD and diarization options, and implement RTTM format output for diarization results.