Skip to content

Conversation

@ai-cora
Copy link
Collaborator

@ai-cora ai-cora commented Nov 24, 2025

Summary

Status: Work in Progress / Draft

Implements cassette-deck-style voice control commands (pause, resume, play, send, stop) for hands-free conversation management with whisper-stream.

Use Cases

  1. Let Me Finish: Disable VAD silence detection to speak at length without interruption
  2. Privacy Pause: Temporarily stop recording for phone calls or sensitive conversations
  3. Transcription Review: Review what was transcribed before sending to LLM
  4. Hands-Free Control: Manage entire conversation flow using voice commands

Architecture

State Machine

┌─────────────┐
│  RECORDING  │ ←─────┐
└──────┬──────┘       │
       │              │
       │ "pause"      │ "resume"
       │              │
       ▼              │
┌─────────────┐       │
│   PAUSED    │───────┘
└──────┬──────┘
       │
       │ "send", "stop", "play"
       │
       ▼
    [RETURN]
  • RECORDING: Actively capturing and transcribing audio
  • PAUSED: Not recording audio, but whisper-stream still running
  • Terminal commands: "send", "stop", "play" return control to caller

Implementation Progress (19 commits)

Core Infrastructure

  • Add SDL2 dependencies and stream-capture specification
  • Add stream_capture module with control phrase detection
  • Integrate stream_capture with converse tool
  • Add --stream-mode flag to converse CLI command

Control Flow

  • Implement pause/resume state machine
  • Strip all control phrases from transcription
  • Use word boundary matching for control phrase detection
  • Show transcription and command logging at INFO level

Deduplication & Quality

  • Add detailed logging for deduplication process
  • Add --keep 0 and --length parameters to reduce duplicates
  • Skip stale segments after resume to prevent paused content leakage
  • Track and log pause/resume timing with relative timestamps

Debug & Testing

  • Add debug output file option to capture raw whisper-stream data
  • Add timing-based whisper output processor (WIP)
  • Add audio feedback for control commands
  • Add CLI test script for stream_capture
  • Add testing guide for stream mode implementation

Bug Fixes

  • Use binary mode and manual decode for asyncio subprocess
  • Remove unsupported bufsize parameter from asyncio subprocess
  • Strip punctuation from words for control phrase matching

What's Working

  • Basic pause/resume flow control
  • Control phrase detection (send, pause, resume, stop)
  • Deduplication of overlapping whisper segments
  • CLI interface with --stream-mode flag
  • Debug logging and output capture
  • Audio feedback for commands

Known Issues / TODO

  • Timing-based processor marked as WIP
  • Playback feature not fully implemented
  • Need comprehensive integration testing
  • Performance optimization for long sessions
  • Documentation needs completion

Files Changed

  • voice_mode/stream_capture.py: New module for stream capture
  • voice_mode/tools/converse.py: Integration with converse tool
  • voice_mode/cli.py: CLI support for --stream-mode
  • docs/stream-capture-spec.md: Complete specification
  • installer/voicemode_install/dependencies.yaml: SDL2 dependencies
  • Test scripts and documentation

Test Plan

  • Basic CLI test script created
  • Testing guide documented
  • Integration tests with converse tool
  • Edge case testing (rapid state changes, long pauses)
  • Real-world usage testing

Related

  • Parent epic: VM-179 (Conversation flow control)
  • Builds on: feature/let-me-finish branch

Note: This is a draft PR to document work in progress. Not ready for review or merge yet.

🤖 Generated with Claude Code

mbailey and others added 19 commits November 25, 2025 01:14
Add SDL2 library requirements for whisper-stream binary:
- Debian/Ubuntu: libsdl2-dev
- Fedora: SDL2-devel
- macOS: sdl2 (via Homebrew)

Add comprehensive specification for stream-capture with cassette-deck
flow controls (record, pause, resume, play, send, stop) to enable
hands-free voice conversation management.

Related to VM-194: Implement stream-capture with cassette-deck flow controls
Parent epic: VM-179 Conversation flow control

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implements core stream capture functionality using whisper-stream:
- Real-time audio capture with whisper-stream subprocess
- Control phrase detection (send, pause, resume, play, stop)
- Segment deduplication from overlapping whisper output
- Returns structured dict with text, control signal, and metadata

Based on working Saturday audio intelligence code with additions from
let-me-finish branch. MVP focuses on "send" command for Phase 1.

Features:
- VAD mode (step 0) to avoid duplicate segments
- Async subprocess management with proper cleanup
- Configurable control phrases
- Debug logging for troubleshooting

Related to VM-194: Implement stream-capture with cassette-deck flow controls
Parent epic: VM-179 Conversation flow control

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add simple command-line test script to verify stream_capture works:
- Checks whisper-stream availability
- Runs stream_capture with configurable max duration
- Shows control phrases and captures audio
- Displays results including control signal detected

Usage:
    python test_stream_capture.py [--max-duration SECONDS]

This allows testing the stream_capture module independently before
integrating with the converse tool.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add stream_mode parameter to converse tool that enables whisper-stream
based capture with control phrase detection:

- New stream_mode parameter (default: false) for backward compatibility
- Validates whisper-stream availability when stream_mode enabled
- Uses stream_capture() instead of VAD recording when stream_mode=true
- Skips separate STT processing (whisper-stream does it during capture)
- Returns transcription with control signal detection

Usage:
    converse "Hello" stream_mode=true

User can speak and say control phrases:
- "send", "i'm done", "go ahead" - submit text
- "pause", "hold on" - pause recording
- "resume", "continue" - resume recording
- "play back", "repeat" - review transcription
- "stop", "cancel" - discard recording

MVP Phase 1: Implements basic send command detection.
Future phases will add pause/resume/play/stop handling.

Related to VM-194: Implement stream-capture with cassette-deck flow controls
Parent epic: VM-179 Conversation flow control

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add --stream-mode flag to voicemode converse command to enable
whisper-stream based capture with control phrase detection:

- New --stream-mode flag in Click command decorator
- Pass stream_mode parameter to all converse_fn.fn calls
- Works in both single and continuous conversation modes

Usage:
    voicemode converse --stream-mode
    voicemode converse --stream-mode --continuous

When stream mode is enabled, users can speak with flow control:
- Say "send" or "i'm done" to submit text
- Say "pause" to pause recording
- Say "resume" to continue recording
- Say "play back" to review transcription
- Say "stop" to cancel

Related to VM-194: Implement stream-capture with cassette-deck flow controls

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive testing guide covering:
- Three testing options (standalone, CLI, MCP)
- Expected behavior and control phrases
- Known limitations and next phase plans
- Git status and commit history

This provides clear instructions for validating the Phase 1 MVP
implementation of stream-capture with control phrase detection.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
asyncio.create_subprocess_exec doesn't support the bufsize parameter
that's available in subprocess.Popen. Remove it to fix "bufsize must be 0"
error when starting stream_capture.

The text=True parameter already provides line-based reading which is
sufficient for our needs.

Fixes runtime error in stream_capture module.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
asyncio.create_subprocess_exec with PIPE requires text=False.
Changed to read bytes and decode manually to fix "text must be False" error.

This is the correct pattern for asyncio subprocess communication
when using PIPE for stdout/stderr.

Fixes runtime error in stream_capture module.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add proper state machine for pause/resume flow control:
- Track current_mode (recording or paused)
- Pause command switches to paused mode, stops adding segments
- Resume command switches back to recording mode
- Only segments captured in recording mode are added to output

Strip all control phrases from final text:
- Track all control phrases detected during capture
- Remove each control phrase from final text
- Handles pause, resume, send, and other control words

This fixes two issues:
1. Words spoken between "pause" and "resume" are now excluded
2. Control words themselves are stripped from output

Example:
  Input: "Hello world pause secret stuff resume and send"
  Output: "Hello world and"

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix false positive matches where control words appear as substrings:
- "pause" was matching in "unpause", "regime", "unfinished"
- Now uses word boundary detection
- Single-word phrases: split text and match exact words
- Multi-word phrases: match complete phrase

This fixes the issue where saying "unpause" or "resume" was being
detected as "pause" instead of "resume".

Example fixes:
- "unpause" -> now correctly detects as "resume" (unpause)
- "unfinished" -> no longer falsely detected as "pause"
- "regime" -> no longer falsely detected as "pause"

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Change transcription logging from DEBUG to INFO for better visibility:
- Show whisper-stream command being launched
- Display each transcribed segment as it arrives
- Show when segments are ignored (paused mode)

Visual indicators:
- 📝 Active transcription (being recorded)
- ⏸️  Paused transcription (being ignored)

This provides real-time feedback during stream capture showing
what's being captured vs. ignored during pause periods.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix control phrase detection to ignore punctuation:
- "Pause." now correctly matches "pause"
- "send!" matches "send"
- "resume," matches "resume"

Without this, control words with punctuation weren't being detected
as control signals, causing them to be added to the transcription
instead of triggering state changes.

This also addresses the duplication issue visible in real-time output
since duplicates are only deduplicated at the end. The deduplication
function works correctly but operates on final output.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add whisper-stream parameters from let-me-finish branch to reduce
overlapping segments:
- --keep 0: Don't keep audio from previous chunks
- --length 30000: Max 30 seconds per chunk

This should reduce (though not eliminate) duplicate segments that
appear during real-time capture. Final deduplication still applied
at the end.

The -t 6 (6 threads) is standard for whisper-stream performance.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add INFO and DEBUG logging to deduplication to show:
- Number of input segments
- After each deduplication pass
- Which segments were removed as substrings
- Word overlaps that were merged
- Final segment count and word count

This helps diagnose why duplicates appear in final output and
verify the deduplication logic is working correctly.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
When transitioning from PAUSED -> RECORDING, whisper-stream may output
delayed refinements of audio from the paused period. These segments
appear AFTER the resume command but contain content from BEFORE.

Solution: Skip the next 3 segments after resume to discard stale
whisper-stream refinements.

This fixes the issue where paused content like "Not get included in
the output" was appearing in the final transcription after resume.

Visual indicators:
- ⏭️  [skipped post-resume] - Discarding stale refinement

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Record timing information for all state changes:
- Relative time since recording started (seconds)
- Whisper t0 timestamp (milliseconds from whisper-stream)
- Log summary of all state changes at end of capture

This provides visibility into when pause/resume occurred during
the recording session, similar to the Saturday audio intelligence
code that tracked segment timing.

Output example:
  State changes during capture:
    pause: 15.3s (whisper t0: 15234ms)
    resume: 42.7s (whisper t0: 42651ms)

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add debug_output_file parameter to stream_capture that uses whisper-stream's
-f flag to save complete raw output to a file for analysis.

When stream_mode is enabled in converse, automatically saves to:
~/tasks/projects/voicemode/VM-194_.../test-data/capture_TIMESTAMP.txt

This captures:
- All whisper-stream output including refinements
- START/END markers with t0/t1 timestamps
- Raw transcription lines
- Timing information

Use this data to develop and test the timing-based filtering algorithm
that will properly handle pause/resume with t0 timestamp filtering.

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add infrastructure for processing raw whisper-stream output using
timestamp-based filtering instead of heuristics:

- WhisperSegment dataclass with start/end times and text
- parse_whisper_timestamp: Convert HH:MM:SS.mmm to seconds
- parse_whisper_line: Extract segments from whisper output format
- process_whisper_output: Filter segments using pause/resume ranges

Algorithm (partial implementation):
1. Parse whisper lines into timestamped segments
2. Build paused time ranges from state_changes
3. Separate t=0 (full retranscriptions) from incremental segments
4. Take longest t=0 segment as base
5. Filter incremental segments by pause ranges

TODO: Complete filtering logic and integrate with stream_capture

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Play immediate audio feedback when control phrases detected:
- Pause: Descending tone (chime_end)
- Resume: Ascending tone (chime_start)
- Send/Stop: Double beep

This gives users instant confirmation that their command was recognized,
even though actual segment filtering happens at end of capture.

Critical for UX - users need to know "pause" worked before speaking
sensitive information.

Also add test_processor.py script and initial process_whisper_output
function with timestamp parsing (WIP - filtering not yet complete).

Related to VM-194

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ai-cora ai-cora requested a review from mbailey November 24, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants