Skip to content

Add on-device dictation with Whisper#68

Merged
Dimillian merged 17 commits intoDimillian:mainfrom
moehaje:moehaje/dictation
Jan 17, 2026
Merged

Add on-device dictation with Whisper#68
Dimillian merged 17 commits intoDimillian:mainfrom
moehaje:moehaje/dictation

Conversation

@moehaje
Copy link
Contributor

@moehaje moehaje commented Jan 16, 2026

What’s in here

  • Adds on-device dictation in the composer (mic button + waveform).
  • Adds Settings for dictation (enable toggle, model picker, preferred language).
  • Models download on demand into app data (no app bundle bloat).
  • Backend capture + Whisper inference + events wired to the UI.
  • CMake check + CI installs for more reliable builds.

Why these decisions

  • I didn’t bundle models: releases stay small and updates stay quick.
  • Auto-detect stays on; preferred language is just a hint.
  • The preferred-language hint only kicks in when it’s close:
    if best_prob > 0.0 && (best_prob - pref_prob) <= 0.30
  • We only insert the final transcript on stop (keeps UI stable and predictable).
  • Added Escape to cancel dictation immediately.
  • Chose a lightweight waveform (no extra UI libs).

Implementation details

  • Avoid blocking async Tauri commands (use async primitives instead of sync waits).
  • Audio callback stays realtime-safe (no heavy work or event emission there).
  • Level events are throttled (~30ms) so UI stays smooth.
  • Download progress is throttled (~150ms) to avoid spammy updates.
  • Download hardening: temp .partial, timeouts, SHA256 verification, atomic rename.
  • Whisper context cached per model to reduce processing latency.
  • Transcript insertion adds whitespace heuristics to avoid word-sticking.

Demo

  • Video:
moehaje-codexmonitor-dictation.mp4

Notes / observations

  • On my machine, Base felt like the best balance (good English results + reasonable processing time).
  • Larger models improved accuracy a bit, but processing time was noticeably slower.

How to test

  1. Open Settings → Dictation → Enable dictation.
  2. Pick a model and download it.
  3. Click the mic button in the composer, speak, then stop.
  4. Confirm transcript inserts at cursor and send works.
  5. Press Escape while listening → dictation cancels immediately.

@Dimillian
Copy link
Owner

Review notes (from deep review):

  1. Composer Enter key can still send while dictation is listening/processing (send button is disabled but key handler doesn't check). Suggest short-circuiting when dictation is busy to prevent accidental submits.
  2. Microphone stream errors emit DictationEvent::Error but don't reset dictation session state back to idle; UI may stay stuck in listening. Suggest setting session_state = Idle and emitting state update on error.
  3. Audio capture buffer grows unbounded until stop; long dictations can spike memory/CPU (also cloned on stop). Suggest max duration/size cap or ring buffer/chunked streaming.

Refs:

  • src/features/composer/components/Composer.tsx (Enter handler)
  • src-tauri/src/dictation.rs (err_fn in start_capture_thread)
  • src-tauri/src/dictation.rs (audio buffer extend in build_stream)

@moehaje
Copy link
Contributor Author

moehaje commented Jan 17, 2026

you are absolutely right! :) I addressed all three points. fixes on the way!

Prevent Enter submit during dictation.
Stop capture and emit idle state on mic errors.
Limit stored samples to avoid unbounded growth.
Allow configurable modifier hold to start/stop dictation.
@moehaje
Copy link
Contributor Author

moehaje commented Jan 17, 2026

another 2 edge-cases fixes on the way!

Avoid error when no transcript is produced.
Let cancel stop processing and suppress transcript.
@moehaje
Copy link
Contributor Author

moehaje commented Jan 17, 2026

  1. in case the user starts and stops dictation process, but says nothing, no errors should be thrown/rendered
  2. allowing to hit esacpe to cancel while processing as well, not only listening

@Dimillian
Copy link
Owner

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b5c694098

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 538 to 542
onChange={(event) =>
void onUpdateAppSettings({
...appSettings,
dictationHoldKey: event.target.value || null,
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid sending null for dictationHoldKey

Selecting “Off” sets dictationHoldKey to null, but the Tauri settings struct expects a non‑optional String (dictation_hold_key in src-tauri/src/types.rs). When the user saves settings with Off selected, update_app_settings will try to deserialize null into a String and fail, so settings updates will error or revert. This only reproduces when Off is chosen; keeping the field a string sentinel (e.g., "") or making the backend field Option<String> would avoid the mismatch.

Useful? React with 👍 / 👎.

src/App.tsx Outdated
Comment on lines 259 to 262
holdDictationActive.current = false;
if (dictationState === "listening") {
void stopDictation();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop hold-to-dictate even if state update lags

The keyup handler only calls stopDictation() when dictationState === "listening". If the user taps and releases the hold key quickly, the backend may still report idle at keyup time, so no stop is sent; when the state flips to listening a moment later, dictation continues even though the key is no longer held. This can leave the mic running unexpectedly; consider tracking the key‑up intent and stopping once listening begins, or canceling immediately on keyup regardless of current state.

Useful? React with 👍 / 👎.

moehaje and others added 5 commits January 17, 2026 10:26
Avoid null values in app settings and make hold-to-dictate stop on keyup even if state lags.
Move keyboard handling into a dedicated hook and shared key matcher.
@Dimillian Dimillian merged commit b8a86b9 into Dimillian:main Jan 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants