Skip to content

jcleigh/uhura

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Uhura

Local, push-to-talk speech-to-text for macOS. Hold a hotkey, speak, release — the transcript pastes into whatever text field is focused. Everything runs on-device via Apple Neural Engine. Audio never leaves your Mac.

Apple Silicon + macOS 14 (Sonoma) or later required.

Installation

Download the latest Uhura.dmg from Releases, open it, and drag Uhura to Applications.

Because Uhura is ad-hoc signed and not notarized, macOS may say the app is damaged on first launch. If so, run once in Terminal:

xattr -cr /Applications/Uhura.app

First launch: Uhura downloads the Whisper small.en model (~460 MB) from Hugging Face on first run. A progress indicator appears in the menu bar. Subsequent launches load in 1–3 s.

Setup

  1. Launch Uhura — the onboarding window opens automatically.
  2. Grant Microphone access (system prompt).
  3. Grant Accessibility access — click the button, toggle Uhura on in System Settings, and return. The app detects the grant within 1 s; no restart needed.
  4. Record your push-to-talk hotkey (any key or chord).
  5. Click Continue.

Use

  • Focus any text field in any app.
  • Hold your hotkey and speak.
  • A floating HUD shows a live waveform and partial transcript while you hold.
  • Release — the final transcript pastes into the focused field via Cmd+V.

Settings

Menu bar icon → Settings…

Setting Default Notes
Input device System default USB and Bluetooth mics appear when connected
Push-to-talk hotkey (unset) Set in onboarding or here
Whisper model small.en (~460 MB) Switching requires a restart
Start/stop sounds On Subtle Tink/Pop chimes
Use Unicode keystrokes Off Slower fallback that avoids the clipboard; use if Cmd+V paste fails in a specific app

Building from source

Requirements: Xcode 16+, XcodeGen

brew install xcodegen
git clone https://github.com/jcleigh/uhura
cd uhura
xcodegen generate
open Uhura.xcodeproj

The first build resolves SPM packages (WhisperKit + KeyboardShortcuts) — allow a couple of minutes.

To run tests: ⌘U in Xcode, or:

xcodebuild test -project Uhura.xcodeproj -scheme Uhura -destination 'platform=macOS'

Releasing

Push a version tag and GitHub Actions builds and publishes the DMG automatically:

git tag v1.0.0
git push origin v1.0.0

The workflow installs XcodeGen, generates the project, resolves packages, builds a Release binary, ad-hoc signs it, packages Uhura.dmg and Uhura.zip, and creates a GitHub Release. See .github/workflows/build-release.yml.

Architecture

Audio pipeline — WhisperKit's AudioProcessor owns AVAudioEngine capture, 48 kHz → 16 kHz Float32 resampling, sample storage, and relativeEnergy: [Float] for the waveform. The app calls startRecordingLive and stopRecording directly rather than going through AudioStreamTranscriber, which is a Swift actor whose executor has no run loop and causes AVAudioEngine.start() to fail with -10877 (kAudioHardwareBadStreamError). All AVAudioEngine calls are made on @MainActor (main thread), which has a run loop.

Dual-pass transcription — while recording, a detached Task calls whisperKit.transcribe(audioArray:) every 800 ms (after ≥ 1 s of new audio accumulates) to drive the HUD partial text. On key-up, stopAndFinalize() runs a single final transcribe pass over the full captured buffer for the cleanest paste result. Streaming = liveness; single-shot = accuracy.

Paste injectionNSPasteboard snapshot → set transcript → 20 ms settle → synthesized Cmd+V via CGEvent on .cghidEventTap → 150 ms → restore original clipboard. .cghidEventTap (not .cgSessionEventTap) is required for reliable delivery into Electron/Chromium apps.

Permissions — Accessibility is detected by polling AXIsProcessTrusted() at 1 Hz. macOS 13+ applies the grant without an app restart. Microphone permission uses AVAudioApplication.requestRecordPermission() on macOS 14+ (not AVCaptureDevice), which both grants TCC and primes the coreaudiod XPC session needed by AVAudioEngine.

Settings persistence@Observable only tracks stored properties. Settings.swift uses stored properties initialized from UserDefaults with didSet persistence rather than computed properties, ensuring SwiftUI observation picks up every change.

Project layout

Uhura/
├── App/
│   ├── UhuraApp.swift              @main, MenuBarExtra + Settings scene
│   ├── AppDelegate.swift           NSApplicationDelegate, owns AppCoordinator
│   └── AppCoordinator.swift        wires hotkey → engine → HUD → paste
├── Capture/
│   ├── DictationEngine.swift       owns WhisperKit, audio capture, transcription
│   └── Settings.swift              UserDefaults-backed preferences (@Observable)
├── Hotkey/
│   └── HotkeyController.swift      KeyboardShortcuts onKeyDown/onKeyUp
├── Injection/
│   ├── PasteInjector.swift         clipboard save → set → Cmd+V → restore
│   └── AccessibilityPermission.swift  AXIsProcessTrusted + 1 Hz polling
├── HUD/
│   ├── HUDWindowController.swift   NSPanel, .floating, .canJoinAllSpaces
│   ├── HUDView.swift               SwiftUI waveform + partial text overlay
│   └── WaveformView.swift          bar viz of AudioProcessor.relativeEnergy
├── MenuBar/
│   ├── MenuBarIcon.swift           custom Uhura badge template image + state badges
│   └── SettingsScene.swift         SwiftUI Settings + Onboarding views
├── Audio/
│   └── Pings.swift                 NSSound start/stop chimes
└── Resources/
    └── Assets.xcassets/            AppIcon (all sizes), MenuBarIcon template

Known limits

  • Not sandboxed — cannot ship to the Mac App Store. Developer ID + notarization is the supported distribution path.
  • Memory: small.en runs ~800 MB–1.2 GB resident. Use tiny.en or base.en for lower-memory machines (configurable in Settings).
  • Single-modifier hotkeys (e.g. Right Option alone) require macOS 15.2+ and KeyboardShortcuts v2.1.0+. Use a chord on older systems.
  • Apple secure text fields and password manager master password prompts intentionally block synthesized paste — clipboard is still restored cleanly.
  • Sub-250 ms keypresses are silently dropped — Whisper hallucinates on near-empty audio.

License

MIT — see LICENSE.

About

Local LLM-powered Speech-To-Text for macOS

Resources

License

Stars

Watchers

Forks

Contributors

Languages