feat(transcribe): add speaker diarization support#188
Open
basnijholt wants to merge 34 commits intomainfrom
Open
feat(transcribe): add speaker diarization support#188basnijholt wants to merge 34 commits intomainfrom
basnijholt wants to merge 34 commits intomainfrom
Conversation
Add speaker diarization as a post-processing step for transcription using pyannote-audio. This identifies and labels different speakers in the transcript, useful for meetings, interviews, or multi-speaker audio. Features: - New `--diarize` flag to enable speaker diarization - `--diarize-format` option for inline (default) or JSON output - `--hf-token` for HuggingFace authentication (required for pyannote models) - `--min-speakers` and `--max-speakers` hints for improved accuracy - Works with any ASR provider (Wyoming, OpenAI, Gemini) - New optional dependency: `pip install agent-cli[diarization]` Output formats: - Inline: `[SPEAKER_00]: Hello, how are you?` - JSON: structured with speaker, timestamps, and text
# Conflicts: # agent_cli/agents/transcribe.py # agent_cli/config.py # pyproject.toml
Add speaker diarization support using pyannote-audio: - Sentence-based alignment (default): fast, splits on punctuation - Word-level alignment (--align-words): uses wav2vec2 for precise timestamps New options: --diarize, --diarize-format, --hf-token, --min-speakers, --max-speakers, --align-words, --align-language
…yannote API - Move diarization imports to module level in transcribe.py per CLAUDE.md rules - Remove defensive hasattr check for pyannote API (pyannote>=3.3 always uses DiarizeOutput) - Update test mocks to use speaker_diarization attribute
- Move diarization processing logic into _apply_diarization() to reduce complexity in _async_main and improve readability - Fix type aliases with TypeAlias annotation for mypy compatibility
…tests - Add Literal["inline", "json"] type for diarize_format in config.py to enable CLI validation of format options - Add comprehensive test suite for alignment.py (20 tests) covering: - AlignedWord dataclass - ALIGN_MODELS configuration - Token conversion and CTC path merging functions - Full alignment pipeline with mocked torchaudio
…t-cli into feat/speaker-diarization
* feat(dev): add --force flag to `dev clean` command Worktrees with modified or untracked files fail to remove with the default `git worktree remove`. Pass --force/-f to force removal. * Update auto-generated docs --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Match WhisperX's alignment algorithm: beam search (width=5) instead of greedy backtracking, wildcard token (-1) for unknown characters, and wildcard emission scoring (max non-blank probability). Add comprehensive tests for alignment and diarization functions.
- Change DEFAULT_BEAM_WIDTH from 5 to 2 to match WhisperX's actual call - Remove redundant bounds checks in _backtrack() and simplify loop guard from `if t <= 0 or j <= 0 or j >= len(tokens)` to `if t <= 0:` to match WhisperX's backtrack_beam - Add wav2vec2 minimum input padding (400 samples) to prevent crashes on very short audio clips - Remove unused _get_dominant_speaker() (dead code, only tested, never called from production) - Add deterministic tests for _merge_repeats and _backtrack, plus cursor test for _get_dominant_speaker_window
- Add explicit empty-path fallback in align() when beam search fails - Simplify _split_words to use str.split() directly - Remove _get_dominant_speaker_and_bounds, reuse _get_dominant_speaker_window in align_transcript_with_speakers with sentence timing as segment bounds - Add tests for _split_into_sentences (abbreviations, initialisms, edge cases) - Add test for empty backtrack result triggering fallback alignment
Cache _get_blank_id() result to avoid duplicate calls, inline trivial _split_words() wrapper, and document beam_width=2 choice. Add tests for _fill_missing_word_bounds edge cases, deterministic end-to-end CTC pipeline verification, and the align_transcript_with_words fallback path.
Use original waveform length (not padded) for duration computation, matching WhisperX behavior where duration is unaffected by wav2vec2 minimum input padding. Also add direct tests for _fallback_word_alignment and the padding code path, and remove a test that only checked dataclass field assignment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pip install agent-cli[diarization]New CLI Options
--diarize/--no-diarize--diarize-formatinline(default) orjson--hf-token--min-speakers--max-speakersOutput Formats
Inline (default):
JSON:
{ "segments": [ {"speaker": "SPEAKER_00", "start": 0.0, "end": 2.5, "text": "Hello, how are you?"} ] }Usage Examples
Test plan
DiarizedSegmentdataclassalign_transcript_with_speakersfunctionformat_diarized_output(inline and JSON)SpeakerDiarizerclass with mocked pyannote