-
Notifications
You must be signed in to change notification settings - Fork 75
feat: implement stream-capture with cassette-deck flow controls (VM-194) [WIP] #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ai-cora
wants to merge
19
commits into
master
Choose a base branch
from
feat/VM-194-implement-stream-capture-with-cassette-deck-flow
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
feat: implement stream-capture with cassette-deck flow controls (VM-194) [WIP] #123
ai-cora
wants to merge
19
commits into
master
from
feat/VM-194-implement-stream-capture-with-cassette-deck-flow
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add SDL2 library requirements for whisper-stream binary: - Debian/Ubuntu: libsdl2-dev - Fedora: SDL2-devel - macOS: sdl2 (via Homebrew) Add comprehensive specification for stream-capture with cassette-deck flow controls (record, pause, resume, play, send, stop) to enable hands-free voice conversation management. Related to VM-194: Implement stream-capture with cassette-deck flow controls Parent epic: VM-179 Conversation flow control 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Implements core stream capture functionality using whisper-stream: - Real-time audio capture with whisper-stream subprocess - Control phrase detection (send, pause, resume, play, stop) - Segment deduplication from overlapping whisper output - Returns structured dict with text, control signal, and metadata Based on working Saturday audio intelligence code with additions from let-me-finish branch. MVP focuses on "send" command for Phase 1. Features: - VAD mode (step 0) to avoid duplicate segments - Async subprocess management with proper cleanup - Configurable control phrases - Debug logging for troubleshooting Related to VM-194: Implement stream-capture with cassette-deck flow controls Parent epic: VM-179 Conversation flow control 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add simple command-line test script to verify stream_capture works:
- Checks whisper-stream availability
- Runs stream_capture with configurable max duration
- Shows control phrases and captures audio
- Displays results including control signal detected
Usage:
python test_stream_capture.py [--max-duration SECONDS]
This allows testing the stream_capture module independently before
integrating with the converse tool.
Related to VM-194
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add stream_mode parameter to converse tool that enables whisper-stream
based capture with control phrase detection:
- New stream_mode parameter (default: false) for backward compatibility
- Validates whisper-stream availability when stream_mode enabled
- Uses stream_capture() instead of VAD recording when stream_mode=true
- Skips separate STT processing (whisper-stream does it during capture)
- Returns transcription with control signal detection
Usage:
converse "Hello" stream_mode=true
User can speak and say control phrases:
- "send", "i'm done", "go ahead" - submit text
- "pause", "hold on" - pause recording
- "resume", "continue" - resume recording
- "play back", "repeat" - review transcription
- "stop", "cancel" - discard recording
MVP Phase 1: Implements basic send command detection.
Future phases will add pause/resume/play/stop handling.
Related to VM-194: Implement stream-capture with cassette-deck flow controls
Parent epic: VM-179 Conversation flow control
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add --stream-mode flag to voicemode converse command to enable
whisper-stream based capture with control phrase detection:
- New --stream-mode flag in Click command decorator
- Pass stream_mode parameter to all converse_fn.fn calls
- Works in both single and continuous conversation modes
Usage:
voicemode converse --stream-mode
voicemode converse --stream-mode --continuous
When stream mode is enabled, users can speak with flow control:
- Say "send" or "i'm done" to submit text
- Say "pause" to pause recording
- Say "resume" to continue recording
- Say "play back" to review transcription
- Say "stop" to cancel
Related to VM-194: Implement stream-capture with cassette-deck flow controls
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive testing guide covering: - Three testing options (standalone, CLI, MCP) - Expected behavior and control phrases - Known limitations and next phase plans - Git status and commit history This provides clear instructions for validating the Phase 1 MVP implementation of stream-capture with control phrase detection. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
asyncio.create_subprocess_exec doesn't support the bufsize parameter that's available in subprocess.Popen. Remove it to fix "bufsize must be 0" error when starting stream_capture. The text=True parameter already provides line-based reading which is sufficient for our needs. Fixes runtime error in stream_capture module. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
asyncio.create_subprocess_exec with PIPE requires text=False. Changed to read bytes and decode manually to fix "text must be False" error. This is the correct pattern for asyncio subprocess communication when using PIPE for stdout/stderr. Fixes runtime error in stream_capture module. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add proper state machine for pause/resume flow control: - Track current_mode (recording or paused) - Pause command switches to paused mode, stops adding segments - Resume command switches back to recording mode - Only segments captured in recording mode are added to output Strip all control phrases from final text: - Track all control phrases detected during capture - Remove each control phrase from final text - Handles pause, resume, send, and other control words This fixes two issues: 1. Words spoken between "pause" and "resume" are now excluded 2. Control words themselves are stripped from output Example: Input: "Hello world pause secret stuff resume and send" Output: "Hello world and" Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Fix false positive matches where control words appear as substrings: - "pause" was matching in "unpause", "regime", "unfinished" - Now uses word boundary detection - Single-word phrases: split text and match exact words - Multi-word phrases: match complete phrase This fixes the issue where saying "unpause" or "resume" was being detected as "pause" instead of "resume". Example fixes: - "unpause" -> now correctly detects as "resume" (unpause) - "unfinished" -> no longer falsely detected as "pause" - "regime" -> no longer falsely detected as "pause" Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Change transcription logging from DEBUG to INFO for better visibility: - Show whisper-stream command being launched - Display each transcribed segment as it arrives - Show when segments are ignored (paused mode) Visual indicators: - 📝 Active transcription (being recorded) - ⏸️ Paused transcription (being ignored) This provides real-time feedback during stream capture showing what's being captured vs. ignored during pause periods. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Fix control phrase detection to ignore punctuation: - "Pause." now correctly matches "pause" - "send!" matches "send" - "resume," matches "resume" Without this, control words with punctuation weren't being detected as control signals, causing them to be added to the transcription instead of triggering state changes. This also addresses the duplication issue visible in real-time output since duplicates are only deduplicated at the end. The deduplication function works correctly but operates on final output. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add whisper-stream parameters from let-me-finish branch to reduce overlapping segments: - --keep 0: Don't keep audio from previous chunks - --length 30000: Max 30 seconds per chunk This should reduce (though not eliminate) duplicate segments that appear during real-time capture. Final deduplication still applied at the end. The -t 6 (6 threads) is standard for whisper-stream performance. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add INFO and DEBUG logging to deduplication to show: - Number of input segments - After each deduplication pass - Which segments were removed as substrings - Word overlaps that were merged - Final segment count and word count This helps diagnose why duplicates appear in final output and verify the deduplication logic is working correctly. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
When transitioning from PAUSED -> RECORDING, whisper-stream may output delayed refinements of audio from the paused period. These segments appear AFTER the resume command but contain content from BEFORE. Solution: Skip the next 3 segments after resume to discard stale whisper-stream refinements. This fixes the issue where paused content like "Not get included in the output" was appearing in the final transcription after resume. Visual indicators: - ⏭️ [skipped post-resume] - Discarding stale refinement Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Record timing information for all state changes:
- Relative time since recording started (seconds)
- Whisper t0 timestamp (milliseconds from whisper-stream)
- Log summary of all state changes at end of capture
This provides visibility into when pause/resume occurred during
the recording session, similar to the Saturday audio intelligence
code that tracked segment timing.
Output example:
State changes during capture:
pause: 15.3s (whisper t0: 15234ms)
resume: 42.7s (whisper t0: 42651ms)
Related to VM-194
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add debug_output_file parameter to stream_capture that uses whisper-stream's -f flag to save complete raw output to a file for analysis. When stream_mode is enabled in converse, automatically saves to: ~/tasks/projects/voicemode/VM-194_.../test-data/capture_TIMESTAMP.txt This captures: - All whisper-stream output including refinements - START/END markers with t0/t1 timestamps - Raw transcription lines - Timing information Use this data to develop and test the timing-based filtering algorithm that will properly handle pause/resume with t0 timestamp filtering. Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add infrastructure for processing raw whisper-stream output using timestamp-based filtering instead of heuristics: - WhisperSegment dataclass with start/end times and text - parse_whisper_timestamp: Convert HH:MM:SS.mmm to seconds - parse_whisper_line: Extract segments from whisper output format - process_whisper_output: Filter segments using pause/resume ranges Algorithm (partial implementation): 1. Parse whisper lines into timestamped segments 2. Build paused time ranges from state_changes 3. Separate t=0 (full retranscriptions) from incremental segments 4. Take longest t=0 segment as base 5. Filter incremental segments by pause ranges TODO: Complete filtering logic and integrate with stream_capture Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Play immediate audio feedback when control phrases detected: - Pause: Descending tone (chime_end) - Resume: Ascending tone (chime_start) - Send/Stop: Double beep This gives users instant confirmation that their command was recognized, even though actual segment filtering happens at end of capture. Critical for UX - users need to know "pause" worked before speaking sensitive information. Also add test_processor.py script and initial process_whisper_output function with timestamp parsing (WIP - filtering not yet complete). Related to VM-194 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Status: Work in Progress / Draft
Implements cassette-deck-style voice control commands (pause, resume, play, send, stop) for hands-free conversation management with whisper-stream.
Use Cases
Architecture
State Machine
Implementation Progress (19 commits)
Core Infrastructure
Control Flow
Deduplication & Quality
Debug & Testing
Bug Fixes
What's Working
Known Issues / TODO
Files Changed
voice_mode/stream_capture.py: New module for stream capturevoice_mode/tools/converse.py: Integration with converse toolvoice_mode/cli.py: CLI support for --stream-modedocs/stream-capture-spec.md: Complete specificationinstaller/voicemode_install/dependencies.yaml: SDL2 dependenciesTest Plan
Related
Note: This is a draft PR to document work in progress. Not ready for review or merge yet.
🤖 Generated with Claude Code