Skip to content

feat(transcribe): add speaker diarization support#188

Open
basnijholt wants to merge 34 commits intomainfrom
feat/speaker-diarization
Open

feat(transcribe): add speaker diarization support#188
basnijholt wants to merge 34 commits intomainfrom
feat/speaker-diarization

Conversation

@basnijholt
Copy link
Owner

Summary

  • Add speaker diarization as a post-processing step for transcription using pyannote-audio
  • Identifies and labels different speakers in the transcript (useful for meetings, interviews, multi-speaker audio)
  • Works with any ASR provider (Wyoming, OpenAI, Gemini)
  • New optional dependency: pip install agent-cli[diarization]

New CLI Options

Option Description
--diarize/--no-diarize Enable speaker diarization
--diarize-format Output format: inline (default) or json
--hf-token HuggingFace token for pyannote models (required)
--min-speakers Minimum number of speakers hint
--max-speakers Maximum number of speakers hint

Output Formats

Inline (default):

[SPEAKER_00]: Hello, how are you?
[SPEAKER_01]: I'm doing well, thanks!

JSON:

{
  "segments": [
    {"speaker": "SPEAKER_00", "start": 0.0, "end": 2.5, "text": "Hello, how are you?"}
  ]
}

Usage Examples

# Install diarization extra
pip install agent-cli[diarization]

# Basic diarization
agent-cli transcribe --diarize --hf-token YOUR_HF_TOKEN

# Diarize a meeting recording with known participants
agent-cli transcribe --from-file meeting.wav --diarize --min-speakers 2 --max-speakers 4 --hf-token YOUR_TOKEN

Test plan

  • Unit tests for DiarizedSegment dataclass
  • Unit tests for align_transcript_with_speakers function
  • Unit tests for format_diarized_output (inline and JSON)
  • Unit tests for SpeakerDiarizer class with mocked pyannote
  • Updated existing transcribe recovery tests with new parameters
  • All 513 tests passing
  • Pre-commit hooks passing

basnijholt and others added 30 commits January 10, 2026 07:12
Add speaker diarization as a post-processing step for transcription using
pyannote-audio. This identifies and labels different speakers in the
transcript, useful for meetings, interviews, or multi-speaker audio.

Features:
- New `--diarize` flag to enable speaker diarization
- `--diarize-format` option for inline (default) or JSON output
- `--hf-token` for HuggingFace authentication (required for pyannote models)
- `--min-speakers` and `--max-speakers` hints for improved accuracy
- Works with any ASR provider (Wyoming, OpenAI, Gemini)
- New optional dependency: `pip install agent-cli[diarization]`

Output formats:
- Inline: `[SPEAKER_00]: Hello, how are you?`
- JSON: structured with speaker, timestamps, and text
# Conflicts:
#	agent_cli/agents/transcribe.py
#	agent_cli/config.py
#	pyproject.toml
Add speaker diarization support using pyannote-audio:
- Sentence-based alignment (default): fast, splits on punctuation
- Word-level alignment (--align-words): uses wav2vec2 for precise timestamps

New options: --diarize, --diarize-format, --hf-token, --min-speakers,
--max-speakers, --align-words, --align-language
…yannote API

- Move diarization imports to module level in transcribe.py per CLAUDE.md rules
- Remove defensive hasattr check for pyannote API (pyannote>=3.3 always uses DiarizeOutput)
- Update test mocks to use speaker_diarization attribute
- Move diarization processing logic into _apply_diarization() to reduce
  complexity in _async_main and improve readability
- Fix type aliases with TypeAlias annotation for mypy compatibility
…tests

- Add Literal["inline", "json"] type for diarize_format in config.py
  to enable CLI validation of format options
- Add comprehensive test suite for alignment.py (20 tests) covering:
  - AlignedWord dataclass
  - ALIGN_MODELS configuration
  - Token conversion and CTC path merging functions
  - Full alignment pipeline with mocked torchaudio
* feat(dev): add --force flag to `dev clean` command

Worktrees with modified or untracked files fail to remove with the
default `git worktree remove`. Pass --force/-f to force removal.

* Update auto-generated docs

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Match WhisperX's alignment algorithm: beam search (width=5) instead of
greedy backtracking, wildcard token (-1) for unknown characters, and
wildcard emission scoring (max non-blank probability). Add comprehensive
tests for alignment and diarization functions.
- Change DEFAULT_BEAM_WIDTH from 5 to 2 to match WhisperX's actual call
- Remove redundant bounds checks in _backtrack() and simplify loop guard
  from `if t <= 0 or j <= 0 or j >= len(tokens)` to `if t <= 0:` to
  match WhisperX's backtrack_beam
- Add wav2vec2 minimum input padding (400 samples) to prevent crashes on
  very short audio clips
- Remove unused _get_dominant_speaker() (dead code, only tested, never
  called from production)
- Add deterministic tests for _merge_repeats and _backtrack, plus cursor
  test for _get_dominant_speaker_window
- Add explicit empty-path fallback in align() when beam search fails
- Simplify _split_words to use str.split() directly
- Remove _get_dominant_speaker_and_bounds, reuse _get_dominant_speaker_window
  in align_transcript_with_speakers with sentence timing as segment bounds
- Add tests for _split_into_sentences (abbreviations, initialisms, edge cases)
- Add test for empty backtrack result triggering fallback alignment
Cache _get_blank_id() result to avoid duplicate calls, inline trivial
_split_words() wrapper, and document beam_width=2 choice. Add tests for
_fill_missing_word_bounds edge cases, deterministic end-to-end CTC
pipeline verification, and the align_transcript_with_words fallback path.
Use original waveform length (not padded) for duration computation,
matching WhisperX behavior where duration is unaffected by wav2vec2
minimum input padding. Also add direct tests for _fallback_word_alignment
and the padding code path, and remove a test that only checked dataclass
field assignment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant