Most transcription apps use a two-step process: ASR transcription followed by LLM cleanup. Voice Notepad sends audio directly to multimodal AI models that transcribe and format in a single pass.
| Traditional Approach | Voice Notepad |
|---|---|
| Record → ASR → Raw text → LLM → Formatted output | Record → Multimodal AI → Formatted output |
| Two API calls, higher latency | Single API call, faster results |
| AI reads text only | AI "hears" your voice |
The AI hears tone, pauses, and emphasis. Verbal commands like "scratch that" or "new paragraph" work naturally.
- Cost-effective — 848 transcriptions for $1.17 (~1.4¢ per 1,000 words)
- Fast — Single API call with local preprocessing
- Smart cleanup — Removes filler words, adds punctuation, formats output
- Global hotkeys — Record from anywhere, even when minimized
- Flexible output — App window, clipboard, or inject directly at cursor
|
|
User Manual v3 (PDF) Complete 27-page guide covering installation, configuration, hotkey setup, and troubleshooting. |
|
|
Online Documentation Markdown docs for installation, audio pipeline, cost tracking, and technical reference. |
- Download from Releases (AppImage, .deb, or Windows installer)
- Add your API key (Google Gemini or OpenRouter)
- Press Record, speak naturally, press Transcribe
- Get clean, formatted text
# Or run from source
git clone https://github.com/danielrosehill/Voice-Notepad.git
cd Voice-Notepad && ./run.shVoice Notepad combines local preprocessing with cloud transcription for optimal cost and quality.
flowchart LR
subgraph LOCAL["🖥️ Local Preprocessing"]
direction LR
A[🎤 Record<br/>48kHz] --> B[📊 AGC<br/>Normalize]
B --> C[🔇 VAD<br/>Remove Silence]
C --> D[📦 Compress<br/>16kHz mono]
end
subgraph CLOUD["☁️ Cloud Transcription"]
direction LR
E[📝 Prompt<br/>Concatenation] --> F[🤖 Gemini API<br/>Audio + Prompt]
F --> G[✨ Formatted<br/>Text]
end
D --> E
style LOCAL fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style CLOUD fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
| Stage | Component | Purpose |
|---|---|---|
| Local | AGC | Normalizes audio levels (target -3 dBFS) |
| Local | VAD | Strips silence — typically 30-80% reduction |
| Local | Compress | Downsamples to 16kHz mono WAV |
| Cloud | Prompt Concatenation | Builds layered instructions |
| Cloud | Gemini API | Single-pass transcription + cleanup |
Voice Notepad uses a layered prompt architecture where instructions are concatenated at transcription time. This allows flexible, modular control over output formatting.
flowchart TB
subgraph FOUNDATION["🏗️ Foundation Layer (Always Applied)"]
F1[Remove filler words]
F2[Add punctuation]
F3[Fix grammar & spelling]
F4[Honor verbal commands]
F5[Handle background audio]
end
subgraph FORMAT["📋 Format Layer"]
FMT[Email / Todo / Meeting Notes<br/>Blog / Documentation / AI Prompt]
end
subgraph STYLE["🎨 Style Layer"]
S1[Formality<br/>Casual → Professional]
S2[Verbosity<br/>None → Maximum reduction]
end
subgraph PERSONAL["👤 Personalization"]
P1[Email signatures]
P2[User name]
end
FOUNDATION --> FORMAT
FORMAT --> STYLE
STYLE --> PERSONAL
PERSONAL --> OUTPUT[📤 Final Prompt]
style FOUNDATION fill:#fff3e0,stroke:#ff9800,stroke-width:2px
style FORMAT fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
style STYLE fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style PERSONAL fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
Prompt Stacks let you save and combine multiple prompt layers for recurring workflows:
| Stack Example | Layers Combined |
|---|---|
| Meeting Notes + Actions | Foundation + Meeting format + Action item extraction |
| Technical Documentation | Foundation + Doc format + Code extraction + Markdown |
| Quick Email | Foundation + Email format + Professional tone + Signature |
Create custom stacks in the Prompt Stacks tab, then apply them with a single click.
| Provider | Recommended Model | Notes |
|---|---|---|
| Google Gemini | gemini-flash-latest |
Direct API, auto-updates to latest Flash model |
| OpenRouter | google/gemini-2.5-flash |
Per-key cost tracking, OpenAI-compatible API |
| Component | Technology |
|---|---|
| Transcription | Google Gemini / OpenRouter |
| Voice Activity Detection | TEN VAD |
| Text-to-Speech | Edge TTS |
| Database | Mongita |
| UI Framework | PyQt6 |
See Technology Stack for details.
Real usage from ~2,000 transcriptions shows OpenRouter's Gemini 2.5 Flash delivers 2x faster inference:
| Provider | Model | Avg Inference | Chars/sec |
|---|---|---|---|
| Gemini Direct | gemini-flash-latest | 5.1s | 90 |
| OpenRouter | google/gemini-2.5-flash | 2.5s | 204 |
Anonymized usage data available in data/.
This software was developed through AI-human collaboration. Code was generated by Claude Opus 4.5 under my direction—I designed the architecture and specified requirements while Claude wrote the implementation.
- Audio-Multimodal-AI-Resources — Curated list of audio-capable multimodal models
- Audio-Understanding-Test-Prompts — Test prompts for evaluating audio understanding
MIT




