Voice Notepad

Multimodal Cloud Transcription for Desktop

Download · User Manual (PDF) · Documentation

Why Voice Notepad?

Most transcription apps use a two-step process: ASR transcription followed by LLM cleanup. Voice Notepad sends audio directly to multimodal AI models that transcribe and format in a single pass.

Traditional Approach	Voice Notepad
Record → ASR → Raw text → LLM → Formatted output	Record → Multimodal AI → Formatted output
Two API calls, higher latency	Single API call, faster results
AI reads text only	AI "hears" your voice

The AI hears tone, pauses, and emphasis. Verbal commands like "scratch that" or "new paragraph" work naturally.

Key Benefits

Cost-effective — 848 transcriptions for $1.17 (~1.4¢ per 1,000 words)
Fast — Single API call with local preprocessing
Smart cleanup — Removes filler words, adds punctuation, formats output
Global hotkeys — Record from anywhere, even when minimized
Flexible output — App window, clipboard, or inject directly at cursor

Documentation

	User Manual v3 (PDF) Complete 27-page guide covering installation, configuration, hotkey setup, and troubleshooting.
	Online Documentation Markdown docs for installation, audio pipeline, cost tracking, and technical reference.

Quick Start

Download from Releases (AppImage, .deb, or Windows installer)
Add your API key (Google Gemini or OpenRouter)
Press Record, speak naturally, press Transcribe
Get clean, formatted text

# Or run from source
git clone https://github.com/danielrosehill/Voice-Notepad.git
cd Voice-Notepad && ./run.sh

Dual-Pipeline Architecture

Voice Notepad combines local preprocessing with cloud transcription for optimal cost and quality.

flowchart LR
    subgraph LOCAL["🖥️ Local Preprocessing"]
        direction LR
        A[🎤 Record<br/>48kHz] --> B[📊 AGC<br/>Normalize]
        B --> C[🔇 VAD<br/>Remove Silence]
        C --> D[📦 Compress<br/>16kHz mono]
    end

    subgraph CLOUD["☁️ Cloud Transcription"]
        direction LR
        E[📝 Prompt<br/>Concatenation] --> F[🤖 Gemini API<br/>Audio + Prompt]
        F --> G[✨ Formatted<br/>Text]
    end

    D --> E

    style LOCAL fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style CLOUD fill:#e3f2fd,stroke:#2196f3,stroke-width:2px

Stage	Component	Purpose
Local	AGC	Normalizes audio levels (target -3 dBFS)
Local	VAD	Strips silence — typically 30-80% reduction
Local	Compress	Downsamples to 16kHz mono WAV
Cloud	Prompt Concatenation	Builds layered instructions
Cloud	Gemini API	Single-pass transcription + cleanup

Prompt Concatenation System

Voice Notepad uses a layered prompt architecture where instructions are concatenated at transcription time. This allows flexible, modular control over output formatting.

flowchart TB
    subgraph FOUNDATION["🏗️ Foundation Layer (Always Applied)"]
        F1[Remove filler words]
        F2[Add punctuation]
        F3[Fix grammar & spelling]
        F4[Honor verbal commands]
        F5[Handle background audio]
    end

    subgraph FORMAT["📋 Format Layer"]
        FMT[Email / Todo / Meeting Notes<br/>Blog / Documentation / AI Prompt]
    end

    subgraph STYLE["🎨 Style Layer"]
        S1[Formality<br/>Casual → Professional]
        S2[Verbosity<br/>None → Maximum reduction]
    end

    subgraph PERSONAL["👤 Personalization"]
        P1[Email signatures]
        P2[User name]
    end

    FOUNDATION --> FORMAT
    FORMAT --> STYLE
    STYLE --> PERSONAL
    PERSONAL --> OUTPUT[📤 Final Prompt]

    style FOUNDATION fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    style FORMAT fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    style STYLE fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style PERSONAL fill:#e3f2fd,stroke:#2196f3,stroke-width:2px

Prompt Stacks

Prompt Stacks let you save and combine multiple prompt layers for recurring workflows:

Stack Example	Layers Combined
Meeting Notes + Actions	Foundation + Meeting format + Action item extraction
Technical Documentation	Foundation + Doc format + Code extraction + Markdown
Quick Email	Foundation + Email format + Professional tone + Signature

Create custom stacks in the Prompt Stacks tab, then apply them with a single click.

Supported Providers

Provider	Recommended Model	Notes
Google Gemini	`gemini-flash-latest`	Direct API, auto-updates to latest Flash model
OpenRouter	`google/gemini-2.5-flash`	Per-key cost tracking, OpenAI-compatible API

Screenshots

Click to expand screenshots

Main Interface

Analytics Dashboard

Global Hotkeys

Prompt Formats

Technology Stack

Component	Technology
Transcription	Google Gemini / OpenRouter
Voice Activity Detection	TEN VAD
Text-to-Speech	Edge TTS
Database	Mongita
UI Framework	PyQt6

See Technology Stack for details.

Benchmark Data

Real usage from ~2,000 transcriptions shows OpenRouter's Gemini 2.5 Flash delivers 2x faster inference:

Provider	Model	Avg Inference	Chars/sec
Gemini Direct	gemini-flash-latest	5.1s	90
OpenRouter	google/gemini-2.5-flash	2.5s	204

Anonymized usage data available in data/.

AI-Human Co-Authorship

This software was developed through AI-human collaboration. Code was generated by Claude Opus 4.5 under my direction—I designed the architecture and specified requirements while Claude wrote the implementation.

Related Projects

Audio-Multimodal-AI-Resources — Curated list of audio-capable multimodal models
Audio-Understanding-Test-Prompts — Test prompts for evaluating audio understanding

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.claude		.claude
.github/workflows		.github/workflows
app		app
archive		archive
docs		docs
graphics		graphics
node_modules		node_modules
project-planner		project-planner
reference/data		reference/data
screenshots		screenshots
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
changelog.md		changelog.md
local-build.sh		local-build.sh
mkdocs.yml		mkdocs.yml
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
release.sh		release.sh
run.sh		run.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Voice Notepad

Why Voice Notepad?

Key Benefits

Documentation

Quick Start

Dual-Pipeline Architecture

Prompt Concatenation System

Prompt Stacks

Supported Providers

Screenshots

Main Interface

Analytics Dashboard

Global Hotkeys

Prompt Formats

Technology Stack

Benchmark Data

AI-Human Co-Authorship

Related Projects

License

About

Uh oh!

Releases 6

Languages

License

danielrosehill/AI-Transcription-Notepad

Folders and files

Latest commit

History

Repository files navigation

Voice Notepad

Why Voice Notepad?

Key Benefits

Documentation

Quick Start

Dual-Pipeline Architecture

Prompt Concatenation System

Prompt Stacks

Supported Providers

Screenshots

Main Interface

Analytics Dashboard

Global Hotkeys

Prompt Formats

Technology Stack

Benchmark Data

AI-Human Co-Authorship

Related Projects

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Languages