-
Notifications
You must be signed in to change notification settings - Fork 415
Initial speaker assignment support #1624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis pull request adds speaker assignment functionality to the transcript editor, refactors shared operations type definitions, implements multi-mode audio channel support in the listener actor pipeline, and extends stream response handling with offset/metadata methods. Changes span desktop transcript UI components, utility functions for segment building and speaker hints, and Rust audio processing actors. Changes
Sequence DiagramssequenceDiagram
participant User as User UI
participant Editor as TranscriptEditor
participant Header as SegmentHeader
participant Menu as ContextMenu
participant Store as Store
participant Callback as onAssignSpeaker
User->>Header: Hover segment (editor mode)
Header->>Menu: Render Assign Speaker menu
Menu->>Store: Fetch available humans
User->>Menu: Select speaker
Menu->>Header: handleAssignSpeaker(wordIds, humanId)
Header->>Callback: Call onAssignSpeaker
Callback->>Store: Create speaker_hints entries
Callback->>Store: Register checkpoint
sequenceDiagram
participant Source as SourceActor
participant Listener as ListenerActor
participant Processor as ProcessorActor
participant Joiner as Joiner
participant Stream as StreamResponse
Source->>Source: Compute new_mode (platform-specific)
Source->>Processor: ProcMsg::SetMode(new_mode)
Source->>Listener: ListenerMsg::ChangeMode(new_mode)
Processor->>Processor: reset_pipeline()
Listener->>Listener: Spawn new RX task with mode
Listener->>Joiner: pop_pair(mode)
alt mode == Single
Joiner->>Joiner: Mix mic + spk
else mode == Dual
Joiner->>Joiner: Return separate mic/spk
end
Listener->>Stream: apply_offset()
Listener->>Stream: set_extra(started_unix_secs)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
plugins/listener/src/actors/source.rs (2)
233-358: Significant code duplication between use_mixed branches.The
use_mixed = trueblock (lines 236-294) anduse_mixed = falseblock (lines 302-356) contain nearly identical logic for creating mic/speaker streams and processing them in atokio::select!loop. The only differences are platform-specific guards and minor variable handling.Consider extracting the common stream processing logic into a helper function to reduce duplication and improve maintainability:
async fn process_audio_streams( mic_device: Option<String>, token: CancellationToken, stream_cancel_token: CancellationToken, mic_muted: Arc<AtomicBool>, myself: ActorRef<SourceMsg>, ) { let mic_stream = { let mut mic_input = AudioInput::from_mic(mic_device).unwrap(); ResampledAsyncSource::new(mic_input.stream(), SAMPLE_RATE).chunks(AEC_BLOCK_SIZE) }; tokio::time::sleep(tokio::time::Duration::from_millis(50)).await; let spk_stream = { let mut spk_input = AudioInput::from_speaker(); ResampledAsyncSource::new(spk_input.stream(), SAMPLE_RATE).chunks(AEC_BLOCK_SIZE) }; // ... common select! loop logic }Then both branches can call this helper with appropriate cfg guards.
273-279: Consider caching zero buffer for muted mic data.The muting logic allocates a new
Vec<f32>filled with zeros for each muted chunk. While correct, this could be optimized by caching a zero buffer (similar tosilence_cachein processor.rs) or using a pre-allocated static buffer.Example optimization:
// At SourceState level zero_buffer: Arc<[f32]>, // pre-allocated zeros matching AEC_BLOCK_SIZE // In processing: let output_data = if mic_muted.load(Ordering::Relaxed) { zero_buffer.clone() } else { Arc::from(data) };This would eliminate repeated allocations during muted periods.
apps/desktop/src/utils/speaker-hints.ts (1)
37-56: Consider extracting duplicate JSON parsing logic.The user_speaker_assignment handling is implemented correctly with proper validation. However, the JSON parsing logic (lines 38-46) is duplicated with the parseProviderSpeakerIndex implementation (lines 68-76).
Consider extracting a shared helper:
+const parseJsonValue = (raw: unknown): unknown | undefined => { + if (raw == null) { + return undefined; + } + + return typeof raw === "string" + ? (() => { + try { + return JSON.parse(raw); + } catch { + return undefined; + } + })() + : raw; +}; + export function convertStorageHintsToRuntime( storageHints: SpeakerHintStorage[], wordIdToIndex: Map<string, number>, ): RuntimeSpeakerHint[] { const hints: RuntimeSpeakerHint[] = []; storageHints.forEach((hint) => { // ... existing validation ... if (hint.type === "provider_speaker_index") { const parsed = parseProviderSpeakerIndex(hint.value); // ... } else if (hint.type === "user_speaker_assignment") { - const data = typeof hint.value === "string" - ? (() => { - try { - return JSON.parse(hint.value); - } catch { - return undefined; - } - })() - : hint.value; + const data = parseJsonValue(hint.value); if (data && typeof data === "object" && "human_id" in data && typeof data.human_id === "string") { // ... } } }); } const parseProviderSpeakerIndex = (raw: unknown): ProviderSpeakerIndexHint | undefined => { - if (raw == null) { - return undefined; - } - - const data = typeof raw === "string" - ? (() => { - try { - return JSON.parse(raw); - } catch { - return undefined; - } - })() - : raw; + const data = parseJsonValue(raw); return providerSpeakerIndexSchema.safeParse(data).data; };
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
apps/desktop/src/components/main/body/sessions/note-input/transcript/editor.tsx(2 hunks)apps/desktop/src/components/main/body/sessions/note-input/transcript/index.tsx(1 hunks)apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/index.tsx(6 hunks)apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/operations.tsx(1 hunks)apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/segment-header.tsx(4 hunks)apps/desktop/src/devtool/seed/data/curated.json(1 hunks)apps/desktop/src/utils/segment.test.ts(1 hunks)apps/desktop/src/utils/segment.ts(3 hunks)apps/desktop/src/utils/speaker-hints.ts(1 hunks)owhisper/owhisper-client/src/lib.rs(1 hunks)owhisper/owhisper-interface/src/stream.rs(1 hunks)plugins/listener/src/actors/listener.rs(6 hunks)plugins/listener/src/actors/mod.rs(1 hunks)plugins/listener/src/actors/processor.rs(7 hunks)plugins/listener/src/actors/session.rs(6 hunks)plugins/listener/src/actors/source.rs(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (12)
apps/desktop/src/components/main/body/sessions/note-input/transcript/index.tsx (1)
apps/desktop/src/contexts/listener.tsx (1)
useListener(33-47)
plugins/listener/src/actors/session.rs (2)
plugins/listener/src/actors/listener.rs (1)
name(51-53)plugins/listener/src/actors/source.rs (1)
name(48-50)
owhisper/owhisper-interface/src/stream.rs (1)
plugins/listener/js/bindings.gen.ts (2)
StreamResponse(98-98)Extra(93-93)
apps/desktop/src/components/main/body/sessions/note-input/transcript/editor.tsx (1)
apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/index.tsx (1)
TranscriptContainer(25-104)
apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/index.tsx (2)
apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/operations.tsx (1)
Operations(1-4)apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/segment-header.tsx (1)
SegmentHeader(18-106)
apps/desktop/src/utils/speaker-hints.ts (1)
packages/db/src/schema.ts (2)
ProviderSpeakerIndexHint(287-287)providerSpeakerIndexSchema(281-285)
apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/segment-header.tsx (3)
apps/desktop/src/utils/segment.ts (1)
Segment(31-34)apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/operations.tsx (1)
Operations(1-4)packages/db/src/schema.ts (1)
humans(61-74)
apps/desktop/src/utils/segment.test.ts (1)
apps/desktop/src/utils/segment.ts (2)
SegmentKey(36-40)SegmentKey(42-46)
plugins/listener/src/actors/processor.rs (1)
crates/audio-utils/src/lib.rs (1)
f32_to_i16_bytes(51-61)
plugins/listener/src/actors/listener.rs (2)
crates/ws/src/client.rs (1)
finalize_with_text(23-27)owhisper/owhisper-interface/src/stream.rs (4)
default(60-67)default(82-93)apply_offset(141-160)set_extra(162-166)
apps/desktop/src/utils/segment.ts (2)
plugins/db/js/bindings.gen.ts (1)
SpeakerIdentity(200-200)packages/db/src/schema.ts (2)
words(130-141)speakerHints(144-154)
plugins/listener/src/actors/source.rs (3)
crates/audio/src/lib.rs (1)
is_using_headphone(204-217)plugins/listener/src/actors/listener.rs (1)
name(51-53)plugins/listener/src/actors/processor.rs (1)
name(53-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: ci (macos, macos-14)
🔇 Additional comments (20)
plugins/listener/src/actors/source.rs (2)
8-8: LGTM! Clean addition of ChannelMode state and query capability.The GetMode RPC pattern matches the existing GetMicMute/GetMicDevice patterns, and initialization to Dual is appropriate.
Also applies to: 23-23, 42-42, 122-122
161-165: LGTM! GetMode handler follows established patterns.The implementation correctly returns the current mode via RPC reply, consistent with other getter handlers.
plugins/listener/src/actors/processor.rs (5)
11-11: LGTM! Clean mode-driven architecture additions.The SetMode/Reset message variants and mode state field establish a clear control flow for switching between Single and Dual processing modes.
Also applies to: 20-21, 36-37, 76-76
39-47: LGTM! Comprehensive pipeline reset.The
reset_pipelinemethod properly clears all stateful components including joiner queues, cached audio, AGC instances, and timing state.
99-107: LGTM! Proper mode change handling with conditional reset.The SetMode handler correctly checks whether the mode has actually changed before resetting the pipeline, avoiding unnecessary disruption.
131-147: LGTM! Mode-dependent audio routing is correctly implemented.In Single mode,
spk_bytescontains the mixed (mic + spk) audio, which is appropriate for single-channel transcription. The clamping prevents overflow/clipping.Note: RecorderActor (lines 118-123) always receives mixed audio regardless of mode, which appears intentional.
195-205: LGTM! Efficient silence caching and queue overflow protection.The
get_silencemethod properly caches zero buffers to avoid repeated allocations, and queue size limits with overflow warnings prevent unbounded memory growth.Note:
reset()correctly preserves thesilence_cachesince it's just an optimization cache.Also applies to: 209-220
apps/desktop/src/components/main/body/sessions/note-input/transcript/editor.tsx (2)
4-4: LGTM!The
idutility import is correctly added to support hint ID generation.
58-61: LGTM!The operations object correctly wires both
onDeleteWordandonAssignSpeakerhandlers to the TranscriptContainer.apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/index.tsx (4)
21-21: LGTM!Correctly imports the centralized
Operationstype to replace the localWordOperationsdefinition.
30-30: LGTM!The refactoring to use the centralized
Operationstype is consistent throughout the component hierarchy (TranscriptContainer → RenderTranscript → SegmentRenderer → WordSpan).Also applies to: 133-133, 182-182, 243-243
115-115: LGTM!The decorative separator text change improves the visual distinction between transcript segments.
197-197: LGTM!Correctly passes the
operationsprop toSegmentHeader, enabling the context menu functionality for speaker assignment.apps/desktop/src/components/main/body/sessions/note-input/transcript/shared/segment-header.tsx (6)
4-16: LGTM!Correctly imports ContextMenu components, the Operations type, and necessary utilities to support the speaker assignment UI.
18-18: LGTM!The function signature correctly extends
SegmentHeaderto accept an optionaloperationsprop, maintaining backward compatibility while enabling editor functionality.
44-59: LGTM!Good implementation of editor mode detection and speaker assignment handler:
- Properly derives mode based on operations presence
- Safely filters words with IDs before non-null assertion
- Correctly memoizes the assignment handler with appropriate dependencies
78-103: LGTM!The context menu implementation demonstrates good UX practices:
- Only enabled in editor mode when there are assignable words
- Provides clear hierarchical menu structure (Assign Speaker → human list)
- Gracefully handles empty humans list with a disabled item
- Includes defensive fallback from
human?.nametohumanId
108-125: LGTM!The
useSegmentColorimplementation correctly generates distinct colors per speaker using chroma-js with channel-based palettes and proper memoization.
127-148: LGTM!The
useSpeakerLabelhook demonstrates excellent defensive coding:
- Prioritizes human names from the store when available
- Falls back to channel-based labels using
ChannelProfileenum- Provides sensible defaults for all cases (speaker index or channel letter)
apps/desktop/src/utils/speaker-hints.ts (1)
63-79: No breaking changes detected.The verification confirms that
parseProviderSpeakerIndexis only used internally withinapps/desktop/src/utils/speaker-hints.ts(line 25), with no imports from other files in the codebase. The function is not exported, so the change is safe.
apps/desktop/src/components/main/body/sessions/note-input/transcript/editor.tsx
Show resolved
Hide resolved
| let outbound = tokio_stream::StreamExt::map( | ||
| tokio_stream::wrappers::ReceiverStream::new(rx), | ||
| |msg| match msg { | ||
| MixedMessage::Audio((_mic, spk)) => MixedMessage::Audio(spk), | ||
| MixedMessage::Control(c) => MixedMessage::Control(c), | ||
| }, | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single-channel stream is forwarding the wrong audio track
In the single-mode branch we drop the mic buffers and forward the speaker channel instead. In Single mode the upstream Source only guarantees the first tuple element (mic); the second element is often empty. With the current mapping every single-channel session will stream silence (or the wrong track) to Deepgram, breaking transcription entirely. Please forward the mic data here.
- MixedMessage::Audio((_mic, spk)) => MixedMessage::Audio(spk),
+ MixedMessage::Audio((mic, _spk)) => MixedMessage::Audio(mic),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let outbound = tokio_stream::StreamExt::map( | |
| tokio_stream::wrappers::ReceiverStream::new(rx), | |
| |msg| match msg { | |
| MixedMessage::Audio((_mic, spk)) => MixedMessage::Audio(spk), | |
| MixedMessage::Control(c) => MixedMessage::Control(c), | |
| }, | |
| ); | |
| let outbound = tokio_stream::StreamExt::map( | |
| tokio_stream::wrappers::ReceiverStream::new(rx), | |
| |msg| match msg { | |
| MixedMessage::Audio((mic, _spk)) => MixedMessage::Audio(mic), | |
| MixedMessage::Control(c) => MixedMessage::Control(c), | |
| }, | |
| ); |
🤖 Prompt for AI Agents
In plugins/listener/src/actors/listener.rs around lines 208 to 214, the
single-channel mapping currently drops the mic buffer and forwards the speaker
buffer, which is wrong because in Single mode the upstream Source only
guarantees the mic (first tuple element); change the mapping so that for
MixedMessage::Audio((mic, _)) you forward MixedMessage::Audio(mic) instead of
the speaker, and keep MixedMessage::Control unchanged; this ensures
single-channel sessions send the mic track to Deepgram.
No description provided.