Skip to content

Conversation

@sealad886
Copy link
Contributor

Introduce Voice Activity Detection (VAD) to enhance transcription efficiency by filtering silent audio regions. Implement speaker diarization using pyannote.audio for identifying speakers in multi-speaker audio. Update CLI and documentation to reflect these new features and provide usage examples. Enhance the transcribe function to support both VAD and diarization options, and implement RTTM format output for diarization results.

- Introduced VAD functionality to filter silent audio regions, improving transcription efficiency.
- Added speaker diarization capabilities using pyannote.audio, allowing identification of speakers in multi-speaker audio.
- Updated CLI and README to reflect new features and usage examples.
- Enhanced transcribe function to support VAD and diarization options.
- Implemented RTTM format output for diarization results.

Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 7, 2025 04:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Voice Activity Detection (VAD) and speaker diarization capabilities to mlx-whisper. VAD filters silent audio regions before transcription to improve efficiency, while speaker diarization identifies who is speaking when in multi-speaker audio using pyannote.audio.

  • VAD integration using Silero VAD with configurable options (threshold, silence duration, padding)
  • Speaker diarization using pyannote.audio with speaker assignment to transcript segments and words
  • RTTM format output support for diarization results

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
whisper/setup.py Adds optional dependencies for VAD (torch) and diarization (pyannote.audio>=3.1, pandas, torch)
whisper/mlx_whisper/writers.py Implements RTTM writer for diarization output and adds speaker labels to VTT/SRT subtitles
whisper/mlx_whisper/vad.py New module implementing Silero VAD with speech detection, chunk extraction, and timestamp mapping
whisper/mlx_whisper/transcribe.py Extends transcribe function with VAD preprocessing, adds transcribe_with_diarization function, implements segment deduplication and filtering
whisper/mlx_whisper/diarize.py New module providing pyannote.audio integration for speaker diarization with speaker-to-segment assignment
whisper/mlx_whisper/cli.py Adds CLI arguments for VAD (--vad-filter, thresholds) and diarization (--diarize, speaker counts, HF token)
whisper/mlx_whisper/init.py Exports new transcribe_with_diarization function and optionally imports vad/diarize modules
whisper/README.md Documents VAD and diarization features with usage examples and requirements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


Args:
model_name: HuggingFace model ID for diarization
use_auth_token: HuggingFace token (required for gated models)
Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name in the docstring is use_auth_token, but the actual parameter is named token. Update the docstring to match the parameter name.

Suggested change
use_auth_token: HuggingFace token (required for gated models)
token: HuggingFace token (required for gated models)

Copilot uses AI. Check for mistakes.
Comment on lines +188 to +189
# Use bisect_left: find the chunk this sample falls within
chunk_idx = bisect.bisect_left(self.chunk_end_sample, sample)
Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of bisect_left here may not correctly find the chunk that contains the sample. bisect_left returns the leftmost insertion point, but for finding which chunk a sample falls within, you likely need bisect_right or handle the boundary conditions differently.

For example, if chunk_end_sample = [1000, 2000, 3000] and sample = 1000, bisect_left returns 0, but the sample at 1000 should belong to the first chunk (index 0), not before it. However, if sample = 1001, bisect_left returns 1, which would incorrectly skip to the second chunk's silence offset.

Consider using bisect_right instead, which would correctly identify that sample 1000 belongs to chunk 0, and sample 1001 also belongs to chunk 1.

Suggested change
# Use bisect_left: find the chunk this sample falls within
chunk_idx = bisect.bisect_left(self.chunk_end_sample, sample)
# Use bisect_right: find the chunk this sample falls within
chunk_idx = bisect.bisect_right(self.chunk_end_sample, sample)

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +24
import pandas as pd

_PANDAS_AVAILABLE = True
except ImportError:
pd = None

Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'pd' is not used.

Suggested change
import pandas as pd
_PANDAS_AVAILABLE = True
except ImportError:
pd = None
import pandas
_PANDAS_AVAILABLE = True
except ImportError:
pass

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +30
from pyannote.audio import Pipeline

_PYANNOTE_AVAILABLE = True
except ImportError:
Pipeline = None
Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Pipeline' is not used.

Suggested change
from pyannote.audio import Pipeline
_PYANNOTE_AVAILABLE = True
except ImportError:
Pipeline = None
import pyannote.audio
_PYANNOTE_AVAILABLE = True
except ImportError:
pyannote = None

Copilot uses AI. Check for mistakes.
# Graceful dependency handling
_TORCH_AVAILABLE = False
try:
import torch
Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'torch' is not used.

Copilot uses AI. Check for mistakes.
try:
if hasattr(diarization_output, "exclusive_speaker_diarization") and len(ann) == 0: # type: ignore[arg-type]
ann = diarization_output.exclusive_speaker_diarization
except Exception:
Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
except Exception:
# Ignore all exceptions here: fallback to exclusive_speaker_diarization is optional,
# and if it fails, we simply return the original (possibly empty) annotation.

Copilot uses AI. Check for mistakes.
import torch

_TORCH_AVAILABLE = True
except ImportError:
Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except ImportError:
except ImportError:
# torch is optional; if not available, set _TORCH_AVAILABLE to False and continue.

Copilot uses AI. Check for mistakes.
Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>
- Add BeamSearchDecoder class ported from OpenAI whisper (torch to MLX)
- Support beam_size and patience parameters in DecodingOptions
- Update DecodingTask to use BeamSearchDecoder when beam_size is provided
- Handle different return types between GreedyDecoder (mx.array) and
  BeamSearchDecoder (List[List[List[int]]]) in run()
- Remove _sanitize_decoding_options() workaround from transcribe.py
- Add comprehensive unit tests for beam search decoder
- Update README with beam search CLI and API documentation
- Add .gitignore to exclude test artifacts

The patience parameter controls early stopping via
max_candidates = round(beam_size * patience)
…VAD and diarization

Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant