Add Voice Activity Detection and Speaker Diarization support #1398

sealad886 · 2025-12-07T04:20:46Z

Introduce Voice Activity Detection (VAD) to enhance transcription efficiency by filtering silent audio regions. Implement speaker diarization using pyannote.audio for identifying speakers in multi-speaker audio. Update CLI and documentation to reflect these new features and provide usage examples. Enhance the transcribe function to support both VAD and diarization options, and implement RTTM format output for diarization results.

- Introduced VAD functionality to filter silent audio regions, improving transcription efficiency. - Added speaker diarization capabilities using pyannote.audio, allowing identification of speakers in multi-speaker audio. - Updated CLI and README to reflect new features and usage examples. - Enhanced transcribe function to support VAD and diarization options. - Implemented RTTM format output for diarization results. Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>

Copilot

Pull request overview

This PR adds Voice Activity Detection (VAD) and speaker diarization capabilities to mlx-whisper. VAD filters silent audio regions before transcription to improve efficiency, while speaker diarization identifies who is speaking when in multi-speaker audio using pyannote.audio.

VAD integration using Silero VAD with configurable options (threshold, silence duration, padding)
Speaker diarization using pyannote.audio with speaker assignment to transcript segments and words
RTTM format output support for diarization results

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
whisper/setup.py	Adds optional dependencies for VAD (`torch`) and diarization (`pyannote.audio>=3.1`, `pandas`, `torch`)
whisper/mlx_whisper/writers.py	Implements RTTM writer for diarization output and adds speaker labels to VTT/SRT subtitles
whisper/mlx_whisper/vad.py	New module implementing Silero VAD with speech detection, chunk extraction, and timestamp mapping
whisper/mlx_whisper/transcribe.py	Extends transcribe function with VAD preprocessing, adds transcribe_with_diarization function, implements segment deduplication and filtering
whisper/mlx_whisper/diarize.py	New module providing pyannote.audio integration for speaker diarization with speaker-to-segment assignment
whisper/mlx_whisper/cli.py	Adds CLI arguments for VAD (--vad-filter, thresholds) and diarization (--diarize, speaker counts, HF token)
whisper/mlx_whisper/init.py	Exports new transcribe_with_diarization function and optionally imports vad/diarize modules
whisper/README.md	Documents VAD and diarization features with usage examples and requirements

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-07T04:24:54Z

whisper/mlx_whisper/diarize.py

+
+        Args:
+            model_name: HuggingFace model ID for diarization
+            use_auth_token: HuggingFace token (required for gated models)


The parameter name in the docstring is use_auth_token, but the actual parameter is named token. Update the docstring to match the parameter name.

Suggested change

use_auth_token: HuggingFace token (required for gated models)

token: HuggingFace token (required for gated models)

Copilot · 2025-12-07T04:24:54Z

whisper/mlx_whisper/vad.py

+        # Use bisect_left: find the chunk this sample falls within
+        chunk_idx = bisect.bisect_left(self.chunk_end_sample, sample)


The use of bisect_left here may not correctly find the chunk that contains the sample. bisect_left returns the leftmost insertion point, but for finding which chunk a sample falls within, you likely need bisect_right or handle the boundary conditions differently.

For example, if chunk_end_sample = [1000, 2000, 3000] and sample = 1000, bisect_left returns 0, but the sample at 1000 should belong to the first chunk (index 0), not before it. However, if sample = 1001, bisect_left returns 1, which would incorrectly skip to the second chunk's silence offset.

Consider using bisect_right instead, which would correctly identify that sample 1000 belongs to chunk 0, and sample 1001 also belongs to chunk 1.

Suggested change

# Use bisect_left: find the chunk this sample falls within

chunk_idx = bisect.bisect_left(self.chunk_end_sample, sample)

# Use bisect_right: find the chunk this sample falls within

chunk_idx = bisect.bisect_right(self.chunk_end_sample, sample)

Copilot · 2025-12-07T04:24:55Z

whisper/mlx_whisper/diarize.py

+    import pandas as pd
+
+    _PANDAS_AVAILABLE = True
+except ImportError:
+    pd = None
+


Import of 'pd' is not used.

Suggested change

import pandas as pd

_PANDAS_AVAILABLE = True

except ImportError:

pd = None

import pandas

_PANDAS_AVAILABLE = True

except ImportError:

pass

Copilot · 2025-12-07T04:24:55Z

whisper/mlx_whisper/diarize.py

+    from pyannote.audio import Pipeline
+
+    _PYANNOTE_AVAILABLE = True
+except ImportError:
+    Pipeline = None


Import of 'Pipeline' is not used.

Suggested change

from pyannote.audio import Pipeline

_PYANNOTE_AVAILABLE = True

except ImportError:

Pipeline = None

import pyannote.audio

_PYANNOTE_AVAILABLE = True

except ImportError:

pyannote = None

Copilot · 2025-12-07T04:24:55Z

whisper/mlx_whisper/vad.py

+# Graceful dependency handling
+_TORCH_AVAILABLE = False
+try:
+    import torch


Import of 'torch' is not used.

Copilot · 2025-12-07T04:24:55Z

whisper/mlx_whisper/diarize.py

+        try:
+            if hasattr(diarization_output, "exclusive_speaker_diarization") and len(ann) == 0:  # type: ignore[arg-type]
+                ann = diarization_output.exclusive_speaker_diarization
+        except Exception:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except Exception:

except Exception:

# Ignore all exceptions here: fallback to exclusive_speaker_diarization is optional,

# and if it fails, we simply return the original (possibly empty) annotation.

Copilot · 2025-12-07T04:24:56Z

whisper/mlx_whisper/vad.py

+    import torch
+
+    _TORCH_AVAILABLE = True
+except ImportError:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except ImportError:

except ImportError:

# torch is optional; if not available, set _TORCH_AVAILABLE to False and continue.

Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>

- Add BeamSearchDecoder class ported from OpenAI whisper (torch to MLX) - Support beam_size and patience parameters in DecodingOptions - Update DecodingTask to use BeamSearchDecoder when beam_size is provided - Handle different return types between GreedyDecoder (mx.array) and BeamSearchDecoder (List[List[List[int]]]) in run() - Remove _sanitize_decoding_options() workaround from transcribe.py - Add comprehensive unit tests for beam search decoder - Update README with beam search CLI and API documentation - Add .gitignore to exclude test artifacts The patience parameter controls early stopping via max_candidates = round(beam_size * patience)

…VAD and diarization Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>

Copilot AI review requested due to automatic review settings December 7, 2025 04:20

Copilot started reviewing on behalf of sealad886 December 7, 2025 04:21 View session

Copilot AI reviewed Dec 7, 2025

View reviewed changes

sealad886 added 3 commits December 7, 2025 15:22

chore: Bump version to 0.5.0

1d8889f

Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>

feat: Add MLX-native Silero VAD support and update CLI arguments for …

12049f2

…VAD and diarization Signed-off-by: sealad886 <155285242+sealad886@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Voice Activity Detection and Speaker Diarization support #1398

Add Voice Activity Detection and Speaker Diarization support #1398

Uh oh!

sealad886 commented Dec 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 7, 2025

Uh oh!

Copilot AI Dec 7, 2025

Uh oh!

Copilot AI Dec 7, 2025

Uh oh!

Copilot AI Dec 7, 2025

Uh oh!

Copilot AI Dec 7, 2025

Uh oh!

Copilot AI Dec 7, 2025

Uh oh!

Copilot AI Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	use_auth_token: HuggingFace token (required for gated models)
	token: HuggingFace token (required for gated models)

		# Use bisect_left: find the chunk this sample falls within
		chunk_idx = bisect.bisect_left(self.chunk_end_sample, sample)

	except ImportError:
	except ImportError:
	# torch is optional; if not available, set _TORCH_AVAILABLE to False and continue.

Add Voice Activity Detection and Speaker Diarization support #1398

Are you sure you want to change the base?

Add Voice Activity Detection and Speaker Diarization support #1398

Uh oh!

Conversation

sealad886 commented Dec 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant