feat: implement stream-capture with cassette-deck flow controls (VM-194) [WIP] #123

ai-cora · 2025-11-24T14:34:17Z

Summary

Status: Work in Progress / Draft

Implements cassette-deck-style voice control commands (pause, resume, play, send, stop) for hands-free conversation management with whisper-stream.

Use Cases

Let Me Finish: Disable VAD silence detection to speak at length without interruption
Privacy Pause: Temporarily stop recording for phone calls or sensitive conversations
Transcription Review: Review what was transcribed before sending to LLM
Hands-Free Control: Manage entire conversation flow using voice commands

Architecture

State Machine

┌─────────────┐
│  RECORDING  │ ←─────┐
└──────┬──────┘       │
       │              │
       │ "pause"      │ "resume"
       │              │
       ▼              │
┌─────────────┐       │
│   PAUSED    │───────┘
└──────┬──────┘
       │
       │ "send", "stop", "play"
       │
       ▼
    [RETURN]

RECORDING: Actively capturing and transcribing audio
PAUSED: Not recording audio, but whisper-stream still running
Terminal commands: "send", "stop", "play" return control to caller

Implementation Progress (19 commits)

Core Infrastructure

Add SDL2 dependencies and stream-capture specification
Add stream_capture module with control phrase detection
Integrate stream_capture with converse tool
Add --stream-mode flag to converse CLI command

Control Flow

Implement pause/resume state machine
Strip all control phrases from transcription
Use word boundary matching for control phrase detection
Show transcription and command logging at INFO level

Deduplication & Quality

Add detailed logging for deduplication process
Add --keep 0 and --length parameters to reduce duplicates
Skip stale segments after resume to prevent paused content leakage
Track and log pause/resume timing with relative timestamps

Debug & Testing

Add debug output file option to capture raw whisper-stream data
Add timing-based whisper output processor (WIP)
Add audio feedback for control commands
Add CLI test script for stream_capture
Add testing guide for stream mode implementation

Bug Fixes

Use binary mode and manual decode for asyncio subprocess
Remove unsupported bufsize parameter from asyncio subprocess
Strip punctuation from words for control phrase matching

What's Working

Basic pause/resume flow control
Control phrase detection (send, pause, resume, stop)
Deduplication of overlapping whisper segments
CLI interface with --stream-mode flag
Debug logging and output capture
Audio feedback for commands

Known Issues / TODO

Timing-based processor marked as WIP
Playback feature not fully implemented
Need comprehensive integration testing
Performance optimization for long sessions
Documentation needs completion

Files Changed

voice_mode/stream_capture.py: New module for stream capture
voice_mode/tools/converse.py: Integration with converse tool
voice_mode/cli.py: CLI support for --stream-mode
docs/stream-capture-spec.md: Complete specification
installer/voicemode_install/dependencies.yaml: SDL2 dependencies
Test scripts and documentation

Test Plan

Basic CLI test script created
Testing guide documented
Integration tests with converse tool
Edge case testing (rapid state changes, long pauses)
Real-world usage testing

Add SDL2 library requirements for whisper-stream binary: - Debian/Ubuntu: libsdl2-dev - Fedora: SDL2-devel - macOS: sdl2 (via Homebrew) Add comprehensive specification for stream-capture with cassette-deck flow controls (record, pause, resume, play, send, stop) to enable hands-free voice conversation management. Related to VM-194: Implement stream-capture with cassette-deck flow controls Parent epic: VM-179 Conversation flow control 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implements core stream capture functionality using whisper-stream: - Real-time audio capture with whisper-stream subprocess - Control phrase detection (send, pause, resume, play, stop) - Segment deduplication from overlapping whisper output - Returns structured dict with text, control signal, and metadata Based on working Saturday audio intelligence code with additions from let-me-finish branch. MVP focuses on "send" command for Phase 1. Features: - VAD mode (step 0) to avoid duplicate segments - Async subprocess management with proper cleanup - Configurable control phrases - Debug logging for troubleshooting Related to VM-194: Implement stream-capture with cassette-deck flow controls Parent epic: VM-179 Conversation flow control 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add simple command-line test script to verify stream_capture works: - Checks whisper-stream availability - Runs stream_capture with configurable max duration - Shows control phrases and captures audio - Displays results including control signal detected Usage: python test_stream_capture.py [--max-duration SECONDS] This allows testing the stream_capture module independently before integrating with the converse tool. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add stream_mode parameter to converse tool that enables whisper-stream based capture with control phrase detection: - New stream_mode parameter (default: false) for backward compatibility - Validates whisper-stream availability when stream_mode enabled - Uses stream_capture() instead of VAD recording when stream_mode=true - Skips separate STT processing (whisper-stream does it during capture) - Returns transcription with control signal detection Usage: converse "Hello" stream_mode=true User can speak and say control phrases: - "send", "i'm done", "go ahead" - submit text - "pause", "hold on" - pause recording - "resume", "continue" - resume recording - "play back", "repeat" - review transcription - "stop", "cancel" - discard recording MVP Phase 1: Implements basic send command detection. Future phases will add pause/resume/play/stop handling. Related to VM-194: Implement stream-capture with cassette-deck flow controls Parent epic: VM-179 Conversation flow control 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add --stream-mode flag to voicemode converse command to enable whisper-stream based capture with control phrase detection: - New --stream-mode flag in Click command decorator - Pass stream_mode parameter to all converse_fn.fn calls - Works in both single and continuous conversation modes Usage: voicemode converse --stream-mode voicemode converse --stream-mode --continuous When stream mode is enabled, users can speak with flow control: - Say "send" or "i'm done" to submit text - Say "pause" to pause recording - Say "resume" to continue recording - Say "play back" to review transcription - Say "stop" to cancel Related to VM-194: Implement stream-capture with cassette-deck flow controls 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add comprehensive testing guide covering: - Three testing options (standalone, CLI, MCP) - Expected behavior and control phrases - Known limitations and next phase plans - Git status and commit history This provides clear instructions for validating the Phase 1 MVP implementation of stream-capture with control phrase detection. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

asyncio.create_subprocess_exec doesn't support the bufsize parameter that's available in subprocess.Popen. Remove it to fix "bufsize must be 0" error when starting stream_capture. The text=True parameter already provides line-based reading which is sufficient for our needs. Fixes runtime error in stream_capture module. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

asyncio.create_subprocess_exec with PIPE requires text=False. Changed to read bytes and decode manually to fix "text must be False" error. This is the correct pattern for asyncio subprocess communication when using PIPE for stdout/stderr. Fixes runtime error in stream_capture module. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add proper state machine for pause/resume flow control: - Track current_mode (recording or paused) - Pause command switches to paused mode, stops adding segments - Resume command switches back to recording mode - Only segments captured in recording mode are added to output Strip all control phrases from final text: - Track all control phrases detected during capture - Remove each control phrase from final text - Handles pause, resume, send, and other control words This fixes two issues: 1. Words spoken between "pause" and "resume" are now excluded 2. Control words themselves are stripped from output Example: Input: "Hello world pause secret stuff resume and send" Output: "Hello world and" Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fix false positive matches where control words appear as substrings: - "pause" was matching in "unpause", "regime", "unfinished" - Now uses word boundary detection - Single-word phrases: split text and match exact words - Multi-word phrases: match complete phrase This fixes the issue where saying "unpause" or "resume" was being detected as "pause" instead of "resume". Example fixes: - "unpause" -> now correctly detects as "resume" (unpause) - "unfinished" -> no longer falsely detected as "pause" - "regime" -> no longer falsely detected as "pause" Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Change transcription logging from DEBUG to INFO for better visibility: - Show whisper-stream command being launched - Display each transcribed segment as it arrives - Show when segments are ignored (paused mode) Visual indicators: - 📝 Active transcription (being recorded) - ⏸️ Paused transcription (being ignored) This provides real-time feedback during stream capture showing what's being captured vs. ignored during pause periods. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fix control phrase detection to ignore punctuation: - "Pause." now correctly matches "pause" - "send!" matches "send" - "resume," matches "resume" Without this, control words with punctuation weren't being detected as control signals, causing them to be added to the transcription instead of triggering state changes. This also addresses the duplication issue visible in real-time output since duplicates are only deduplicated at the end. The deduplication function works correctly but operates on final output. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add whisper-stream parameters from let-me-finish branch to reduce overlapping segments: - --keep 0: Don't keep audio from previous chunks - --length 30000: Max 30 seconds per chunk This should reduce (though not eliminate) duplicate segments that appear during real-time capture. Final deduplication still applied at the end. The -t 6 (6 threads) is standard for whisper-stream performance. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add INFO and DEBUG logging to deduplication to show: - Number of input segments - After each deduplication pass - Which segments were removed as substrings - Word overlaps that were merged - Final segment count and word count This helps diagnose why duplicates appear in final output and verify the deduplication logic is working correctly. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

When transitioning from PAUSED -> RECORDING, whisper-stream may output delayed refinements of audio from the paused period. These segments appear AFTER the resume command but contain content from BEFORE. Solution: Skip the next 3 segments after resume to discard stale whisper-stream refinements. This fixes the issue where paused content like "Not get included in the output" was appearing in the final transcription after resume. Visual indicators: - ⏭️ [skipped post-resume] - Discarding stale refinement Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Record timing information for all state changes: - Relative time since recording started (seconds) - Whisper t0 timestamp (milliseconds from whisper-stream) - Log summary of all state changes at end of capture This provides visibility into when pause/resume occurred during the recording session, similar to the Saturday audio intelligence code that tracked segment timing. Output example: State changes during capture: pause: 15.3s (whisper t0: 15234ms) resume: 42.7s (whisper t0: 42651ms) Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add debug_output_file parameter to stream_capture that uses whisper-stream's -f flag to save complete raw output to a file for analysis. When stream_mode is enabled in converse, automatically saves to: ~/tasks/projects/voicemode/VM-194_.../test-data/capture_TIMESTAMP.txt This captures: - All whisper-stream output including refinements - START/END markers with t0/t1 timestamps - Raw transcription lines - Timing information Use this data to develop and test the timing-based filtering algorithm that will properly handle pause/resume with t0 timestamp filtering. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add infrastructure for processing raw whisper-stream output using timestamp-based filtering instead of heuristics: - WhisperSegment dataclass with start/end times and text - parse_whisper_timestamp: Convert HH:MM:SS.mmm to seconds - parse_whisper_line: Extract segments from whisper output format - process_whisper_output: Filter segments using pause/resume ranges Algorithm (partial implementation): 1. Parse whisper lines into timestamped segments 2. Build paused time ranges from state_changes 3. Separate t=0 (full retranscriptions) from incremental segments 4. Take longest t=0 segment as base 5. Filter incremental segments by pause ranges TODO: Complete filtering logic and integrate with stream_capture Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Play immediate audio feedback when control phrases detected: - Pause: Descending tone (chime_end) - Resume: Ascending tone (chime_start) - Send/Stop: Double beep This gives users instant confirmation that their command was recognized, even though actual segment filtering happens at end of capture. Critical for UX - users need to know "pause" worked before speaking sensitive information. Also add test_processor.py script and initial process_whisper_output function with timestamp parsing (WIP - filtering not yet complete). Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

mbailey and others added 19 commits November 25, 2025 01:14

ai-cora requested a review from mbailey November 24, 2025 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement stream-capture with cassette-deck flow controls (VM-194) [WIP] #123

feat: implement stream-capture with cassette-deck flow controls (VM-194) [WIP] #123

Uh oh!

ai-cora commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: implement stream-capture with cassette-deck flow controls (VM-194) [WIP] #123

Are you sure you want to change the base?

feat: implement stream-capture with cassette-deck flow controls (VM-194) [WIP] #123

Uh oh!

Conversation

ai-cora commented Nov 24, 2025

Summary

Use Cases

Architecture

State Machine

Implementation Progress (19 commits)

Core Infrastructure

Control Flow

Deduplication & Quality

Debug & Testing

Bug Fixes

What's Working

Known Issues / TODO

Files Changed

Test Plan

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants