Skip to content

Earu/CorpoDrone

Repository files navigation

CorpoDrone

Capture audio from your microphone and speakers simultaneously, transcribe with Whisper, identify who's talking with speaker diarization, and summarize the full session with a local LLM, all displayed in a desktop UI.

Legal Notice

CorpoDrone captures and processes audio on your machine. You alone are responsible for how you use it: local laws, consent, notice, workplace rules, and data-protection requirements are on you. The authors and contributors disclaim liability for any use or misuse. This is not legal advice; if in doubt, don’t run it or ask a lawyer.

Platform License

image image

Features

  • Dual audio capture: mic input and loopback (speaker output) captured simultaneously
  • Real-time transcription: sliding window Whisper transcription with word-level timestamps
  • Speaker diarization: PyAnnote 3.1 assigns speaker labels to each segment
  • Persistent speaker identities: ECAPA-TDNN embeddings match speakers across sessions; enroll names for future recognition
  • Session summary: full audio re-transcribed at session end, then summarized by a local Ollama LLM
  • Live UI: Tauri desktop app with transcript panel, speaker sidebar, and log drawer
  • Apple Silicon acceleration: uses mlx-whisper on M-series Macs for fast on-device transcription via Metal
  • Linux: mic via cpal (ALSA; PipeWire’s ALSA layer works). Loopback records the default output monitor over PulseAudio’s API (libpulse-simple), which PipeWire normally exposes via pipewire-pulse compatibility

Architecture

image

IPC: Tauri spawns audio-capture and pipeline.py as child processes. Audio flows over a named pipe (Windows) or POSIX FIFO (macOS / Linux) as framed binary. Transcript segments and commands flow as JSON lines over a second pipe and stdin/stdout.

Loopback source selection (UI): On macOS, starting a recording opens an app picker so you can include per-app ScreenCaptureKit streams (e.g. Discord) alongside the display mix. On Windows and Linux, loopback is the full desktop mix only; there is no picker.

Requirements

  • Windows 10/11, macOS (Apple Silicon recommended), or 64-bit Linux (glibc; typical desktop with PulseAudio or PipeWire+pulse compat)
  • Rust + cargo
  • Python 3.11 or 3.12 (3.13 is not supported by parts of the WhisperX / pyannote stack yet)
  • Node.js (for Tauri CLI)
  • Ollama running locally with a model matching your config.toml (e.g. ollama pull mistral)
  • A HuggingFace account with access accepted for both:

macOS additional requirements

  • Screen Recording permission granted to your terminal app (for loopback capture via ScreenCaptureKit)
  • Microphone permission granted to your terminal app

Linux additional requirements

The Tauri UI depends on the usual WebKitGTK + GTK 3 stack (Tauri Linux prerequisites). Full desktop installations typically include those libraries; minimal or server images and containers often do not, which surfaces as missing shared libraries when building or running prebuilt binaries.

Building from source

Debian/Ubuntu-style packages:

  • libwebkit2gtk-4.1-dev, libgtk-3-dev, libayatana-appindicator3-dev, librsvg2-dev, patchelf, libssl-dev, pkg-config, build-essential

audio-capture:

  • libasound2-dev (ALSA) — mic capture via cpal
  • libpulse-dev — loopback via libpulse-simple (PipeWire works when the PulseAudio compatibility layer and pactl are available)

Python pipeline: system libsndfile (e.g. libsndfile1) for soundfile, and ffmpeg on PATH where the Whisper stack requires it.

Runtime libraries (prebuilt binaries)

Install WebKitGTK for the Tauri webview (GTK/Cairo/GLib and related libs are pulled in as dependencies). PulseAudio and ALSA client libraries are listed explicitly for audio-capture.

  • Debian / Ubuntu: sudo apt install libwebkit2gtk-4.1-0 libpulse0 libasound2
  • Fedora: sudo dnf install webkitgtk4.1 pulseaudio-libs alsa-lib
  • Arch Linux: sudo pacman -S webkitgtk-4.1 libpulse alsa-lib

Package names vary by release; adjust as needed.

Setup

1. Clone

git clone https://github.com/your-username/CorpoDrone
cd CorpoDrone

2. Configure

Edit config.toml to adjust the Whisper model size, speaker limits, Ollama model, etc.

3. Build and run

cargo tauri dev

On first launch the app will open a setup wizard that handles everything automatically:

  • Detects Python 3.11 or 3.12 on your PATH
  • Creates .venv and installs PyTorch (CUDA on Windows, CPU on Linux, CPU/MPS + mlx-whisper on macOS) and all pipeline dependencies
  • Prompts for your HuggingFace token (required for speaker diarization)
  • Checks whether Ollama is installed

4. Pull Ollama model

ollama pull mistral

5. Production build

cargo tauri build

Configuration

config.toml at the project root controls all runtime behavior:

Key Default Description
python.whisper_model small Whisper model size (tiny / base / small / medium / large-v3)
python.diarize true Enable speaker diarization (requires HF token)
python.min_speakers 1 Minimum expected speakers
python.max_speakers 8 Maximum expected speakers
python.window_seconds 20.0 Sliding window length for real-time transcription
python.step_seconds 3.0 How often to process a new window
python.speech_gate_enabled true Skip Whisper on silent windows (RMS + Silero VAD; reduces silence hallucinations). Settings exposes presets (“Standard”, “Stricter”, “Softer”) and optional expert fields
python.speech_gate_rms_db_floor -50.0 Fast path: below this RMS (dBFS) → no transcription
python.speech_gate_min_speech_fraction 0.12 Silero: minimum fraction of the window labeled speech
python.speech_gate_silero_threshold 0.5 Silero speech probability threshold (higher = stricter)
python.summarize true Generate LLM summary at session end
python.ollama_model mistral Ollama model for summarization
python.ollama_host http://localhost:11434 Ollama API endpoint
server.python_exe .venv/Scripts/python.exe (Win) / .venv/bin/python (Unix) Python interpreter path

Speaker Database

Speaker embeddings are stored in speakers_db.json. When a new speaker is detected whose voice doesn't match any known profile (cosine similarity < 0.58), they get a temporary label. At session end, you can enroll them with a name — that name will be used automatically in future sessions.

To reset the database, delete speakers_db.json.

Tech Stack

Component Technology
Desktop framework Tauri 2 (Rust)
Audio capture (Windows) WASAPI via wasapi crate
Audio capture (macOS) ScreenCaptureKit (loopback) + cpal / CoreAudio (mic)
Audio capture (Linux) cpal / ALSA (mic) + PulseAudio simple API / libpulse-simple (default sink monitor loopback; PipeWire via pulse compat)
Audio resampling Rubato
Transcription (Apple Silicon) mlx-whisper — runs on Metal via MLX
Transcription (other) faster-whisper + WhisperX
Live speech gate RMS prefilter + Silero VAD via torch.hub (before Whisper on all platforms)
Diarization PyAnnote 3.1
Speaker embeddings SpeechBrain ECAPA-TDNN
Summarization Ollama
Structured logging structlog (Python) + tracing (Rust)
Frontend Vanilla JS / CSS