feat(transcribe): add speaker diarization support by basnijholt · Pull Request #188 · basnijholt/agent-cli

basnijholt · 2026-01-10T06:12:56Z

Summary

Add speaker diarization as a post-processing step for transcription using pyannote-audio
Identifies and labels different speakers in the transcript (useful for meetings, interviews, multi-speaker audio)
Works with any ASR provider (Wyoming, OpenAI, Gemini)
New optional dependency: pip install agent-cli[diarization]

New CLI Options

Option	Description
`--diarize/--no-diarize`	Enable speaker diarization
`--diarize-format`	Output format: `inline` (default) or `json`
`--hf-token`	HuggingFace token for pyannote models (required)
`--min-speakers`	Minimum number of speakers hint
`--max-speakers`	Maximum number of speakers hint

Output Formats

Inline (default):

[SPEAKER_00]: Hello, how are you?
[SPEAKER_01]: I'm doing well, thanks!

JSON:

{
  "segments": [
    {"speaker": "SPEAKER_00", "start": 0.0, "end": 2.5, "text": "Hello, how are you?"}
  ]
}

Usage Examples

# Install diarization extra
pip install agent-cli[diarization]

# Basic diarization
agent-cli transcribe --diarize --hf-token YOUR_HF_TOKEN

# Diarize a meeting recording with known participants
agent-cli transcribe --from-file meeting.wav --diarize --min-speakers 2 --max-speakers 4 --hf-token YOUR_TOKEN

Test plan

Unit tests for DiarizedSegment dataclass
Unit tests for align_transcript_with_speakers function
Unit tests for format_diarized_output (inline and JSON)
Unit tests for SpeakerDiarizer class with mocked pyannote
Updated existing transcribe recovery tests with new parameters
All 513 tests passing
Pre-commit hooks passing

Add speaker diarization as a post-processing step for transcription using pyannote-audio. This identifies and labels different speakers in the transcript, useful for meetings, interviews, or multi-speaker audio. Features: - New `--diarize` flag to enable speaker diarization - `--diarize-format` option for inline (default) or JSON output - `--hf-token` for HuggingFace authentication (required for pyannote models) - `--min-speakers` and `--max-speakers` hints for improved accuracy - Works with any ASR provider (Wyoming, OpenAI, Gemini) - New optional dependency: `pip install agent-cli[diarization]` Output formats: - Inline: `[SPEAKER_00]: Hello, how are you?` - JSON: structured with speaker, timestamps, and text

# Conflicts: # agent_cli/agents/transcribe.py # agent_cli/config.py # pyproject.toml

Add speaker diarization support using pyannote-audio: - Sentence-based alignment (default): fast, splits on punctuation - Word-level alignment (--align-words): uses wav2vec2 for precise timestamps New options: --diarize, --diarize-format, --hf-token, --min-speakers, --max-speakers, --align-words, --align-language

…yannote API - Move diarization imports to module level in transcribe.py per CLAUDE.md rules - Remove defensive hasattr check for pyannote API (pyannote>=3.3 always uses DiarizeOutput) - Update test mocks to use speaker_diarization attribute

- Move diarization processing logic into _apply_diarization() to reduce complexity in _async_main and improve readability - Fix type aliases with TypeAlias annotation for mypy compatibility

…tests - Add Literal["inline", "json"] type for diarize_format in config.py to enable CLI validation of format options - Add comprehensive test suite for alignment.py (20 tests) covering: - AlignedWord dataclass - ALIGN_MODELS configuration - Token conversion and CTC path merging functions - Full alignment pipeline with mocked torchaudio

…t-cli into feat/speaker-diarization

* feat(dev): add --force flag to `dev clean` command Worktrees with modified or untracked files fail to remove with the default `git worktree remove`. Pass --force/-f to force removal. * Update auto-generated docs --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Match WhisperX's alignment algorithm: beam search (width=5) instead of greedy backtracking, wildcard token (-1) for unknown characters, and wildcard emission scoring (max non-blank probability). Add comprehensive tests for alignment and diarization functions.

- Change DEFAULT_BEAM_WIDTH from 5 to 2 to match WhisperX's actual call - Remove redundant bounds checks in _backtrack() and simplify loop guard from `if t <= 0 or j <= 0 or j >= len(tokens)` to `if t <= 0:` to match WhisperX's backtrack_beam - Add wav2vec2 minimum input padding (400 samples) to prevent crashes on very short audio clips - Remove unused _get_dominant_speaker() (dead code, only tested, never called from production) - Add deterministic tests for _merge_repeats and _backtrack, plus cursor test for _get_dominant_speaker_window

- Add explicit empty-path fallback in align() when beam search fails - Simplify _split_words to use str.split() directly - Remove _get_dominant_speaker_and_bounds, reuse _get_dominant_speaker_window in align_transcript_with_speakers with sentence timing as segment bounds - Add tests for _split_into_sentences (abbreviations, initialisms, edge cases) - Add test for empty backtrack result triggering fallback alignment

Cache _get_blank_id() result to avoid duplicate calls, inline trivial _split_words() wrapper, and document beam_width=2 choice. Add tests for _fill_missing_word_bounds edge cases, deterministic end-to-end CTC pipeline verification, and the align_transcript_with_words fallback path.

Use original waveform length (not padded) for duration computation, matching WhisperX behavior where duration is unaffected by wav2vec2 minimum input padding. Also add direct tests for _fallback_word_alignment and the padding code path, and remove a test that only checked dataclass field assignment.

basnijholt and others added 30 commits January 10, 2026 07:12

chore: let pyannote-audio manage torch dependency

be3ad09

fix: use 'token' instead of deprecated 'use_auth_token' for pyannote

07d722b

docs: add all required model licenses and token permission info

ecea290

fix: pre-load audio with torchaudio to avoid torchcodec/FFmpeg issues

441c6dc

fix: handle new DiarizeOutput API from pyannote-audio

0465090

fix: show all required model URLs on gated repo access error

17fd7bf

Merge remote-tracking branch 'origin/main' into feat/speaker-diarization

ffde208

# Conflicts: # agent_cli/agents/transcribe.py # agent_cli/config.py # pyproject.toml

Merge ffde208 into 7e4948c

a7c6a64

Update auto-generated docs

a0ca401

Merge 4698867 into 7e4948c

f22bd24

Update auto-generated docs

f279564

refactor(transcribe): extract diarization logic into helper function

48312c5

- Move diarization processing logic into _apply_diarization() to reduce complexity in _async_main and improve readability - Fix type aliases with TypeAlias annotation for mypy compatibility

Merge 48312c5 into 7e4948c

9a24e47

Update auto-generated docs

806000b

Improve diarization alignment and robustness

fff4f74

Merge fff4f74 into 7e4948c

8d1d141

Update auto-generated docs

76acace

Clamp diarization bounds to window

33449b9

Merge branch 'feat/speaker-diarization' of github.com:basnijholt/agen…

e4bd223

…t-cli into feat/speaker-diarization

Add diarization extra to install-extras help text

7f4324d

basnijholt and others added 4 commits February 5, 2026 20:16

Add diarization extra to CI test matrix

b35c23c

Merge b35c23c into 275e545

9e32789

Update auto-generated docs

f943744

Add missing 'it' language to --align-language help text

9194ddb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(transcribe): add speaker diarization support#188

feat(transcribe): add speaker diarization support#188
basnijholt wants to merge 34 commits intomainfrom
feat/speaker-diarization

basnijholt commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

basnijholt commented Jan 10, 2026

Summary

New CLI Options

Output Formats

Usage Examples

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant